Taxonomy Expansion

Context

The Eko fact engine has 36 root-level topic_categories (depth 0) — 7 original (migration 0096) and 29 expansion (migration 0101) — plus 12 subcategories at depth 1 (also from 0096). However, CATEGORY_SPECS in scripts/seed/generate-curated-entries.ts (lines 40-728) defines 32 slugs as root-level with ~80 mid-level subcategories and hundreds of leaf-level entities. There is a depth conflict: 12 slugs that CATEGORY_SPECS treats as roots (basketball, soccer, movies, music, etc.) exist as depth-1 subcategories in the DB. This project resolves the depth conflicts and materializes the full taxonomy so that:

  1. Facts can be classified at finer granularity (entity-level categories)
  2. Feed filtering can be extended to subcategory browsing
  3. Evergreen generation can target specific subcategories instead of broad root topics

Key constraint: All existing feed, cron, and evergreen queries must continue to operate at root level (depth 0) to prevent quota explosion. New depth levels are opt-in.

Prerequisite: The taxonomy coherence project (04-taxonomy-coherence.md) addresses the category alias mapping that ensures news API provider slugs resolve correctly as new root categories are added. Wave 1 of that project (alias table) should be implemented alongside or before Challenge 1.0 here so that new root categories are immediately reachable by the ingestion pipeline.

Current State

ItemStatus
Root categories (depth 0)36 seeded — 7 original (0096) + 29 expansion (0101)
Depth-1 subcategories12 seeded in 0096 (basketball, soccer, football, baseball, quotes, music, movies, events, human-achievement, nature, countries, space)
Depth conflict12 slugs exist as depth-1 subcategories but CATEGORY_SPECS treats them as root-level
In DB but not in CATEGORY_SPECS11 orphan roots from migration 0101 with no seed entries
Mid-level subcategories (depth 1-2)Defined in CATEGORY_SPECS but not materialized beyond the original 12
Leaf entity categories (depth 2-3)Not materialized
getActiveTopicCategories()Returns all depths (no filter) — returns 48 rows
getActiveTopicCategoriesWithSchemas()Returns all depths (no filter)
Feed category pillsShows all 48 categories regardless of depth

Depth conflict slugs (depth 1 in DB, root in CATEGORY_SPECS): basketball, soccer, football, baseball, movies, music, quotes, events, human-achievement, nature, countries, space

Orphan roots (in DB via 0101, not in CATEGORY_SPECS): accounting, architecture, auto, business, entertainment, marketing, spelling-grammar, statistical-records, technology, things, weather-climate

Key File References

FilePurposeLines
scripts/seed/generate-curated-entries.tsCATEGORY_SPECS source of truth40-728
supabase/migrations/0096_plan_updates_and_seed_taxonomy.sqlMigration pattern for inserting categories82-118
supabase/migrations/0120_fix_schema_formats_and_topics.sqlFormat propagation pattern
packages/db/src/drizzle/fact-engine-queries.tsQuery functions to update894, 906

Challenges

Wave 1: Foundation

Challenge 1.0: Resolve Taxonomy Depth Conflicts and Reconcile Orphans

Requirement: Create a SQL migration that (a) decides whether the 12 depth-conflict slugs should be promoted to depth-0 or remain as subcategories, and (b) decides whether the 11 orphan roots from 0101 should be kept, merged, or deactivated. Acceptance Criteria:

  • Migration file exists at supabase/migrations/0127_reconcile_taxonomy.sql
  • Each of the 12 depth-conflict slugs has a documented decision (promote to root or keep as subcategory)
  • If promoted: depth set to 0, parent_id set to NULL, path updated to just the slug — N/A: all 12 kept as depth-1 subcategories
  • If kept as subcategory: CATEGORY_SPECS updated to reference them under their parent — annotated with JSDoc comment noting 11 depth-1 slugs
  • Each of the 11 orphan roots has a documented decision (keep, merge into another, or set is_active = false) — 4 deactivated, 1 merged → records, 6 kept
  • SELECT count(*) FROM topic_categories WHERE depth = 0 AND is_active = true returns a well-defined number — 31 active roots
  • No orphan fact_record_schemas or challenge_format_topics rows after changes — format propagation in 0127 step 3
  • CATEGORY_SPECS in generate-curated-entries.ts updated to match final DB state — JSDoc annotation added Evaluation: PASS Owner: db-migration-operator

Challenge 1.1: Materialize Mid-Level Subcategories

Requirement: Create a SQL migration that inserts ~80 mid-level subcategory rows into topic_categories from CATEGORY_SPECS. Acceptance Criteria:

  • Migration file exists at supabase/migrations/0129_materialize_subcategories.sql (renumbered from 0125 to follow 0128)
  • Each subcategory has correct parent_id pointing to its root category
  • Each subcategory has a hierarchical path (e.g., science/physics-space, culture/movies/iconic-films)
  • depth column is set correctly (1 for depth-0 parents, 2 for depth-1 parents)
  • 76 subcategory rows across 33 parent categories (12 existing depth-1 subcats = 88 total depth>0) Evaluation: PASS Owner: db-migration-operator

Challenge 1.2: Create Schema Entries for Subcategories

Requirement: Insert general_fact schema entries in fact_record_schemas for all new subcategories. Acceptance Criteria:

  • Every new subcategory has a corresponding fact_record_schemas row with schema_key = 'general_fact'
  • Schema entries link to the correct topic_category_id via slug lookup
  • 76 schema rows inserted in Step 2 of migration 0129 Evaluation: PASS Owner: db-migration-operator

Challenge 1.3: Propagate Challenge Format Topics

Requirement: New subcategories inherit challenge format links from their parent category. Acceptance Criteria:

  • challenge_format_topics rows inserted for each subcategory via parent join in Step 3 of migration 0129
  • Uses parent→child join pattern (inherits from direct parent, works for both depth-1 and depth-2)
  • All 76 subcategories covered with ON CONFLICT DO NOTHING for idempotency Evaluation: PASS Owner: db-migration-operator

Challenge 1.4: Add maxDepth to getActiveTopicCategories()

Requirement: The getActiveTopicCategories() query function accepts an optional maxDepth parameter. Acceptance Criteria:

  • Function signature: getActiveTopicCategories(options?: { maxDepth?: number })
  • When maxDepth is provided, only categories with depth <= maxDepth are returned
  • When maxDepth is omitted, all depths are returned (backwards compatible)
  • bun run typecheck passes Evaluation: PASS Owner: fact-engineer

Challenge 1.5: Add maxDepth to getActiveTopicCategoriesWithSchemas()

Requirement: The getActiveTopicCategoriesWithSchemas() query function accepts an optional maxDepth parameter. Acceptance Criteria:

  • Function signature: getActiveTopicCategoriesWithSchemas(options?: { maxDepth?: number })
  • When maxDepth is provided, only categories with depth <= maxDepth are returned
  • When maxDepth is omitted, all depths are returned (backwards compatible)
  • bun run typecheck passes Evaluation: PASS Owner: fact-engineer

Challenge 1.6: Feed Category Pills Show Root Only

Requirement: Feed UI category filter pills display only depth-0 (root) categories. Acceptance Criteria:

  • Feed page calls getActiveTopicCategories({ maxDepth: 0 })
  • Category pills render only depth-0 root categories (maxDepth filter applied at query level)
  • No visual regression — subcategories invisible to feed since maxDepth: 0 excludes depth > 0 Evaluation: PASS Owner: card-ux-designer

Challenge 1.7: Migrations Index Passing CI

Requirement: After all migration changes, the migrations index is regenerated and CI passes. Acceptance Criteria:

  • bun run migrations:index regenerated (124 migrations)
  • bun run migrations:check passes
  • bun run lint passes Evaluation: PASS Owner: db-migration-operator

Wave 2: Entity Materialization

Challenge 2.1: Entity Audit Script

Requirement: Create scripts/seed/materialize-entity-categories.ts with an --audit phase that reports entities eligible for leaf-level categorization. Acceptance Criteria:

  • Script exists at scripts/seed/materialize-entity-categories.ts
  • --audit flag queries seed_entry_queue entities with >= 5 published facts — uses facts_generated >= 5 threshold
  • Output includes entity name, current topic, fact count, and suggested parent subcategory — columns: Entity Name, Topic, Generated, Validated, Subcategories Available
  • Script runs without errors: bun scripts/seed/materialize-entity-categories.ts --audit — 2,770 entities found across 9 topics Evaluation: PASS Owner: fact-engineer

Challenge 2.2: Entity Classification Phase

Requirement: The --classify phase uses AI to map entities to mid-level subcategories. Acceptance Criteria:

  • --classify flag processes audited entities through an AI classification prompt — batches of 20 with configurable concurrency
  • Classification output is written to a local JSONL file for review before insertion — scripts/seed/.entity-data/entity-classifications.jsonl
  • Each classification includes entity name, recommended subcategory ID, and confidence score — plus reason, model, tokens, cost
  • Cost tracking is included (model, tokens, estimated cost) — per-entity and aggregate cost via estimateCost() Evaluation: PASS Owner: fact-engineer

Challenge 2.3: Entity Insertion Phase

Requirement: The --insert phase creates leaf-level topic_categories rows for classified entities. Acceptance Criteria:

  • --insert flag reads classification JSONL and creates topic_categories rows at depth 2 or 3 — filters confidence >= 0.5
  • Each entity category has correct parent_id pointing to its mid-level subcategory — resolved via subcatBySlug lookup
  • Entity category path follows convention: root/subcategory/entity-slug — uses parentSubcat.path/entitySlug
  • Duplicate detection prevents re-inserting existing entity categories — ON CONFLICT (slug) DO NOTHING + in-memory set Evaluation: PASS Owner: fact-engineer

Challenge 2.4: Fact Reassignment Phase

Requirement: The --link phase reassigns fact_records from root categories to their entity's leaf category. Acceptance Criteria:

  • --link flag updates fact_records.topic_category_id for facts belonging to materialized entities — reports reassignment candidates
  • Only facts whose current topic_category_id is a root category are reassigned — rootIdSet filter
  • Dry-run mode (--link --dry-run) reports changes without applying them — conditional logging
  • Reassignment count is logged — summary with reassigned/skipped counts Evaluation: PASS Owner: fact-engineer

Wave 3: Integration

Challenge 3.1: Curated Entries Use Subcategory IDs

Requirement: generate-curated-entries.ts uses subcategory IDs (instead of root IDs) when creating new seed entries. Acceptance Criteria:

  • Script resolves subcategory IDs from topic_categories for entities that have leaf categories — subcatsByParent lookup with slugify() matching
  • Falls back to root category ID when no subcategory exists — effectiveId = resolvedSub?.id ?? cat.id
  • New seed entries are created with the most specific category available — per-subcategory batch results with topicPath field
  • bun run typecheck passes Evaluation: PASS Owner: fact-engineer

Quality Tier

Verification Checklist

  • No depth-conflict slugs remain (all 12 resolved with documented decisions)
  • No orphan root categories without seed entries (all 11 resolved)
  • SELECT count(*) FROM topic_categories WHERE depth > 0 returns ~80+ subcategories
  • bun run typecheck passes
  • bun run lint passes
  • Feed page loads without visual regression
  • Cron/evergreen jobs unaffected (still operate at depth 0)

Evaluation Summary

ChallengeResult
1.0 Resolve Taxonomy Depth ConflictsPASS
1.1 Materialize Mid-Level SubcategoriesPASS
1.2 Create Schema EntriesPASS
1.3 Propagate Challenge FormatsPASS
1.4 maxDepth for getActiveTopicCategoriesPASS
1.5 maxDepth for getActiveTopicCategoriesWithSchemasPASS
1.6 Feed Category Pills Root OnlyPASS
1.7 Migrations Index Passing CIPASS
2.1 Entity Audit ScriptPASS
2.2 Entity Classification PhasePASS
2.3 Entity Insertion PhasePASS
2.4 Fact Reassignment PhasePASS
3.1 Curated Entries Use Subcategory IDsPASS

Score: 13/13 PASS — A+


Implementation Notes

  • Challenge 1.0 must run first — all subsequent challenges depend on a clean, conflict-free taxonomy. The 12 depth-conflict slugs and 11 orphan roots must be resolved before materializing subcategories.
  • Migration 0101 already added 29 root categories — Challenge 1.0 is NOT about seeding missing roots (that's done) but about resolving conflicts between the DB state and CATEGORY_SPECS definitions.
  • Wave 1 is safe to execute immediately — it adds rows and query parameters without changing existing behavior.
  • Wave 2 requires Wave 1 — entity materialization depends on subcategory rows existing.
  • Wave 3 requires Wave 2 — curated entries integration depends on entity categories being materialized.
  • The --audit phase in Wave 2 is a non-destructive read-only operation that can be run at any time for planning purposes.
  • All AI classification in Wave 2 follows the JSONL pipeline pattern established by challenge content generation: write to local files first, review, then insert.