3. Audit Taxonomy Schema Coverage

Purpose: Analyze a taxonomy's fact schema against external structured data (KG, Wikidata, Wikipedia, domain APIs) to surface missing fields, type mismatches, vocabulary gaps, and ownership gaps (missing vocabulary/voice entries for subcategories). Supports single-node audits and recursive tree walks with structured action plan output.

Prerequisites:

  • DATABASE_URL configured (samples entity titles from fact_records)
  • GOOGLE_KG_API_KEY optional (KG checks gracefully skip without it)
  • The target taxonomy must have fact records in the DB — the script samples entity titles (e.g., "LeBron James", "Manchester United") to query external APIs
  • Any active topic_categories slug is valid — subcategories that inherit rules/voice from parents are fully supported

Cost / Duration: $0 (free APIs only) | 1-3 minutes per node (depends on --sample size and --depth)

Prompt

Run a schema coverage audit for the [taxonomy] taxonomy to find
gaps in the fact schema.

```bash
# Single node (backward compatible)
bun scripts/taxonomy/audit-schema-coverage.ts [taxonomy-slug]

# Recursive tree walk
bun scripts/taxonomy/audit-schema-coverage.ts [taxonomy-slug] --depth=1

# Start at a subcategory
bun scripts/taxonomy/audit-schema-coverage.ts basketball --depth=1
```

Options:
- `--sample=N`    — Sample size per node (default 10, max 50)
- `--json`        — Output full action plan as JSON to stdout
- `--summary`     — Output summary counts only
- `--depth=N`     — Recursion depth 0-3 (default: 0 = single-node)
- `--output=PATH` — Output directory (default: docs/reports/taxonomy/)
- `--no-write`    — Skip writing files, console only (default when depth=0)

The script will:
1. Resolve the start node from DB (any active slug, not just root categories)
2. At each node: resolve inherited rules/voice, fetch schema, sample entities
3. Query KG, Wikidata, Wikipedia, and domain-specific APIs per entity
4. Run 6 analysis checks comparing external data against the schema
5. Flag ownership gaps (subcategories missing own vocabulary/voice entries)
6. If depth>0: recurse into children, generate structured action plan
7. Batch-resolve Wikidata property labels after the full tree walk
8. Output report to console and optionally write JSON + Markdown files

Review the suggestions and decide which to act on:
- **Schema field additions:** Add new fact_keys via migration
- **Type fixes:** Update fact_key types via migration
- **Vocabulary additions:** Edit taxonomy-rules-data.ts directly
- **Voice additions:** Edit taxonomy-voices-data.ts directly
- **Ownership gaps:** Create dedicated entries for subcategories

Available root slugs (33 active roots):
animals, architecture, art, auto, business, cooking, culture,
current-events, design, entertainment, fashion, food-beverage,
games, geography, geology, governments, health-medicine, history,
home-living, how-things-work, language-linguistics, math, movies,
music, people, places, publishing, records, science, space-astronomy,
sports, technology, travel, tv, weather-climate

Any active topic_categories slug is valid at any depth (1,104 total
categories). Use `--depth=1` or `--depth=2` to walk children from
any starting node.

Analysis Checks

CheckWhat It SurfacesThreshold
Schema field coverageWikidata properties missing from factKeys≥70% of entities
Type alignmentfactKey type vs Wikidata value type mismatchesAny mismatch
Vocabulary gapsWikipedia category terms missing from domain_terms≥30% of entities
Entity type distributionKG type breakdown (Person, Org, Place, etc.)Informational
Domain-specific coverageTheSportsDB/MusicBrainz fields (sports/music only)≥50% of entities
Sitelink notabilityEntities with <20 Wikipedia language links<20 sitelinks

Action Item Priorities (depth>0)

SourcePriority
Schema field suggestion (≥90% coverage)high
Schema field suggestion (≥70% coverage)medium
Type alignment mismatchhigh
Missing vocabulary entry (subcategory)medium
Missing voice entry (subcategory)medium
Vocabulary gap (Wikipedia categories)low
Low notability entity (<20 sitelinks)low

Verification

  • Script runs without errors for the target taxonomy
  • --depth=0 behaves identically to the original single-node audit
  • --depth=1 walks children and generates action plan files
  • Suggested field additions reviewed and either adopted (via migration) or dismissed with rationale
  • Type mismatches reviewed and corrected if appropriate
  • Vocabulary suggestions reviewed and added to taxonomy-rules-data.ts if appropriate
  • Ownership gaps reviewed and subcategory entries created if appropriate
  • If schema changes were made: bun run typecheck passes
  • If migrations were applied: bun run migrations:index and bun run migrations:check pass

Back to index