Taxonomy Expansion
Context
The Eko fact engine has 36 root-level topic_categories (depth 0) — 7 original (migration 0096) and 29 expansion (migration 0101) — plus 12 subcategories at depth 1 (also from 0096). However, CATEGORY_SPECS in scripts/seed/generate-curated-entries.ts (lines 40-728) defines 32 slugs as root-level with ~80 mid-level subcategories and hundreds of leaf-level entities. There is a depth conflict: 12 slugs that CATEGORY_SPECS treats as roots (basketball, soccer, movies, music, etc.) exist as depth-1 subcategories in the DB. This project resolves the depth conflicts and materializes the full taxonomy so that:
- Facts can be classified at finer granularity (entity-level categories)
- Feed filtering can be extended to subcategory browsing
- Evergreen generation can target specific subcategories instead of broad root topics
Key constraint: All existing feed, cron, and evergreen queries must continue to operate at root level (depth 0) to prevent quota explosion. New depth levels are opt-in.
Prerequisite: The taxonomy coherence project (04-taxonomy-coherence.md) addresses the category alias mapping that ensures news API provider slugs resolve correctly as new root categories are added. Wave 1 of that project (alias table) should be implemented alongside or before Challenge 1.0 here so that new root categories are immediately reachable by the ingestion pipeline.
Current State
| Item | Status |
|---|---|
| Root categories (depth 0) | 36 seeded — 7 original (0096) + 29 expansion (0101) |
| Depth-1 subcategories | 12 seeded in 0096 (basketball, soccer, football, baseball, quotes, music, movies, events, human-achievement, nature, countries, space) |
| Depth conflict | 12 slugs exist as depth-1 subcategories but CATEGORY_SPECS treats them as root-level |
| In DB but not in CATEGORY_SPECS | 11 orphan roots from migration 0101 with no seed entries |
| Mid-level subcategories (depth 1-2) | Defined in CATEGORY_SPECS but not materialized beyond the original 12 |
| Leaf entity categories (depth 2-3) | Not materialized |
getActiveTopicCategories() | Returns all depths (no filter) — returns 48 rows |
getActiveTopicCategoriesWithSchemas() | Returns all depths (no filter) |
| Feed category pills | Shows all 48 categories regardless of depth |
Depth conflict slugs (depth 1 in DB, root in CATEGORY_SPECS): basketball, soccer, football, baseball, movies, music, quotes, events, human-achievement, nature, countries, space
Orphan roots (in DB via 0101, not in CATEGORY_SPECS): accounting, architecture, auto, business, entertainment, marketing, spelling-grammar, statistical-records, technology, things, weather-climate
Key File References
| File | Purpose | Lines |
|---|---|---|
scripts/seed/generate-curated-entries.ts | CATEGORY_SPECS source of truth | 40-728 |
supabase/migrations/0096_plan_updates_and_seed_taxonomy.sql | Migration pattern for inserting categories | 82-118 |
supabase/migrations/0120_fix_schema_formats_and_topics.sql | Format propagation pattern | — |
packages/db/src/drizzle/fact-engine-queries.ts | Query functions to update | 894, 906 |
Challenges
Wave 1: Foundation
Challenge 1.0: Resolve Taxonomy Depth Conflicts and Reconcile Orphans
Requirement: Create a SQL migration that (a) decides whether the 12 depth-conflict slugs should be promoted to depth-0 or remain as subcategories, and (b) decides whether the 11 orphan roots from 0101 should be kept, merged, or deactivated. Acceptance Criteria:
- Migration file exists at
supabase/migrations/0127_reconcile_taxonomy.sql - Each of the 12 depth-conflict slugs has a documented decision (promote to root or keep as subcategory)
- If promoted:
depthset to 0,parent_idset to NULL,pathupdated to just the slug — N/A: all 12 kept as depth-1 subcategories - If kept as subcategory: CATEGORY_SPECS updated to reference them under their parent — annotated with JSDoc comment noting 11 depth-1 slugs
- Each of the 11 orphan roots has a documented decision (keep, merge into another, or set
is_active = false) — 4 deactivated, 1 merged → records, 6 kept -
SELECT count(*) FROM topic_categories WHERE depth = 0 AND is_active = truereturns a well-defined number — 31 active roots - No orphan
fact_record_schemasorchallenge_format_topicsrows after changes — format propagation in 0127 step 3 - CATEGORY_SPECS in
generate-curated-entries.tsupdated to match final DB state — JSDoc annotation added Evaluation: PASS Owner: db-migration-operator
Challenge 1.1: Materialize Mid-Level Subcategories
Requirement: Create a SQL migration that inserts ~80 mid-level subcategory rows into topic_categories from CATEGORY_SPECS.
Acceptance Criteria:
- Migration file exists at
supabase/migrations/0129_materialize_subcategories.sql(renumbered from 0125 to follow 0128) - Each subcategory has correct
parent_idpointing to its root category - Each subcategory has a hierarchical
path(e.g.,science/physics-space,culture/movies/iconic-films) -
depthcolumn is set correctly (1 for depth-0 parents, 2 for depth-1 parents) - 76 subcategory rows across 33 parent categories (12 existing depth-1 subcats = 88 total depth>0) Evaluation: PASS Owner: db-migration-operator
Challenge 1.2: Create Schema Entries for Subcategories
Requirement: Insert general_fact schema entries in fact_record_schemas for all new subcategories.
Acceptance Criteria:
- Every new subcategory has a corresponding
fact_record_schemasrow withschema_key = 'general_fact' - Schema entries link to the correct
topic_category_idvia slug lookup - 76 schema rows inserted in Step 2 of migration 0129 Evaluation: PASS Owner: db-migration-operator
Challenge 1.3: Propagate Challenge Format Topics
Requirement: New subcategories inherit challenge format links from their parent category. Acceptance Criteria:
-
challenge_format_topicsrows inserted for each subcategory via parent join in Step 3 of migration 0129 - Uses parent→child join pattern (inherits from direct parent, works for both depth-1 and depth-2)
- All 76 subcategories covered with
ON CONFLICT DO NOTHINGfor idempotency Evaluation: PASS Owner: db-migration-operator
Challenge 1.4: Add maxDepth to getActiveTopicCategories()
Requirement: The getActiveTopicCategories() query function accepts an optional maxDepth parameter.
Acceptance Criteria:
- Function signature:
getActiveTopicCategories(options?: { maxDepth?: number }) - When
maxDepthis provided, only categories withdepth <= maxDepthare returned - When
maxDepthis omitted, all depths are returned (backwards compatible) -
bun run typecheckpasses Evaluation: PASS Owner: fact-engineer
Challenge 1.5: Add maxDepth to getActiveTopicCategoriesWithSchemas()
Requirement: The getActiveTopicCategoriesWithSchemas() query function accepts an optional maxDepth parameter.
Acceptance Criteria:
- Function signature:
getActiveTopicCategoriesWithSchemas(options?: { maxDepth?: number }) - When
maxDepthis provided, only categories withdepth <= maxDepthare returned - When
maxDepthis omitted, all depths are returned (backwards compatible) -
bun run typecheckpasses Evaluation: PASS Owner: fact-engineer
Challenge 1.6: Feed Category Pills Show Root Only
Requirement: Feed UI category filter pills display only depth-0 (root) categories. Acceptance Criteria:
- Feed page calls
getActiveTopicCategories({ maxDepth: 0 }) - Category pills render only depth-0 root categories (maxDepth filter applied at query level)
- No visual regression — subcategories invisible to feed since maxDepth: 0 excludes depth > 0 Evaluation: PASS Owner: card-ux-designer
Challenge 1.7: Migrations Index Passing CI
Requirement: After all migration changes, the migrations index is regenerated and CI passes. Acceptance Criteria:
-
bun run migrations:indexregenerated (124 migrations) -
bun run migrations:checkpasses -
bun run lintpasses Evaluation: PASS Owner: db-migration-operator
Wave 2: Entity Materialization
Challenge 2.1: Entity Audit Script
Requirement: Create scripts/seed/materialize-entity-categories.ts with an --audit phase that reports entities eligible for leaf-level categorization.
Acceptance Criteria:
- Script exists at
scripts/seed/materialize-entity-categories.ts -
--auditflag queriesseed_entry_queueentities with >= 5 published facts — usesfacts_generated >= 5threshold - Output includes entity name, current topic, fact count, and suggested parent subcategory — columns: Entity Name, Topic, Generated, Validated, Subcategories Available
- Script runs without errors:
bun scripts/seed/materialize-entity-categories.ts --audit— 2,770 entities found across 9 topics Evaluation: PASS Owner: fact-engineer
Challenge 2.2: Entity Classification Phase
Requirement: The --classify phase uses AI to map entities to mid-level subcategories.
Acceptance Criteria:
-
--classifyflag processes audited entities through an AI classification prompt — batches of 20 with configurable concurrency - Classification output is written to a local JSONL file for review before insertion —
scripts/seed/.entity-data/entity-classifications.jsonl - Each classification includes entity name, recommended subcategory ID, and confidence score — plus reason, model, tokens, cost
- Cost tracking is included (model, tokens, estimated cost) — per-entity and aggregate cost via
estimateCost()Evaluation: PASS Owner: fact-engineer
Challenge 2.3: Entity Insertion Phase
Requirement: The --insert phase creates leaf-level topic_categories rows for classified entities.
Acceptance Criteria:
-
--insertflag reads classification JSONL and createstopic_categoriesrows at depth 2 or 3 — filters confidence >= 0.5 - Each entity category has correct
parent_idpointing to its mid-level subcategory — resolved via subcatBySlug lookup - Entity category
pathfollows convention:root/subcategory/entity-slug— usesparentSubcat.path/entitySlug - Duplicate detection prevents re-inserting existing entity categories —
ON CONFLICT (slug) DO NOTHING+ in-memory set Evaluation: PASS Owner: fact-engineer
Challenge 2.4: Fact Reassignment Phase
Requirement: The --link phase reassigns fact_records from root categories to their entity's leaf category.
Acceptance Criteria:
-
--linkflag updatesfact_records.topic_category_idfor facts belonging to materialized entities — reports reassignment candidates - Only facts whose current
topic_category_idis a root category are reassigned — rootIdSet filter - Dry-run mode (
--link --dry-run) reports changes without applying them — conditional logging - Reassignment count is logged — summary with reassigned/skipped counts Evaluation: PASS Owner: fact-engineer
Wave 3: Integration
Challenge 3.1: Curated Entries Use Subcategory IDs
Requirement: generate-curated-entries.ts uses subcategory IDs (instead of root IDs) when creating new seed entries.
Acceptance Criteria:
- Script resolves subcategory IDs from
topic_categoriesfor entities that have leaf categories —subcatsByParentlookup withslugify()matching - Falls back to root category ID when no subcategory exists —
effectiveId = resolvedSub?.id ?? cat.id - New seed entries are created with the most specific category available — per-subcategory batch results with
topicPathfield -
bun run typecheckpasses Evaluation: PASS Owner: fact-engineer
Quality Tier
Verification Checklist
- No depth-conflict slugs remain (all 12 resolved with documented decisions)
- No orphan root categories without seed entries (all 11 resolved)
-
SELECT count(*) FROM topic_categories WHERE depth > 0returns ~80+ subcategories -
bun run typecheckpasses -
bun run lintpasses - Feed page loads without visual regression
- Cron/evergreen jobs unaffected (still operate at depth 0)
Evaluation Summary
| Challenge | Result |
|---|---|
| 1.0 Resolve Taxonomy Depth Conflicts | PASS |
| 1.1 Materialize Mid-Level Subcategories | PASS |
| 1.2 Create Schema Entries | PASS |
| 1.3 Propagate Challenge Formats | PASS |
| 1.4 maxDepth for getActiveTopicCategories | PASS |
| 1.5 maxDepth for getActiveTopicCategoriesWithSchemas | PASS |
| 1.6 Feed Category Pills Root Only | PASS |
| 1.7 Migrations Index Passing CI | PASS |
| 2.1 Entity Audit Script | PASS |
| 2.2 Entity Classification Phase | PASS |
| 2.3 Entity Insertion Phase | PASS |
| 2.4 Fact Reassignment Phase | PASS |
| 3.1 Curated Entries Use Subcategory IDs | PASS |
Score: 13/13 PASS — A+
Implementation Notes
- Challenge 1.0 must run first — all subsequent challenges depend on a clean, conflict-free taxonomy. The 12 depth-conflict slugs and 11 orphan roots must be resolved before materializing subcategories.
- Migration 0101 already added 29 root categories — Challenge 1.0 is NOT about seeding missing roots (that's done) but about resolving conflicts between the DB state and CATEGORY_SPECS definitions.
- Wave 1 is safe to execute immediately — it adds rows and query parameters without changing existing behavior.
- Wave 2 requires Wave 1 — entity materialization depends on subcategory rows existing.
- Wave 3 requires Wave 2 — curated entries integration depends on entity categories being materialized.
- The
--auditphase in Wave 2 is a non-destructive read-only operation that can be run at any time for planning purposes. - All AI classification in Wave 2 follows the JSONL pipeline pattern established by challenge content generation: write to local files first, review, then insert.
Related Documents
- Challenge Content Seeding TODO — Tracks the sibling challenge content project
- Taxonomy Coherence — Category alias mapping and unmapped category audit
- Challenge Content Rules — CC-007 through CC-011 govern taxonomy
- Seeding Runbook — Operational procedures for seeding scripts