Taxonomy Coherence
Context
The Eko fact engine's ingestion pipeline has a taxonomy coherence gap: news API providers use their own category taxonomies (e.g., business, entertainment, health) that do not match our internal topic_categories slugs (e.g., finance, culture, science). When getTopicCategoryBySlug() fails to resolve a slug, the story is silently skipped — no facts are extracted.
This project closes the gap with three layers:
- Category alias normalization — Map external provider categories to internal topic_category IDs
- Unmapped category audit trail — Log unresolvable categories instead of silently dropping them
- Subcategory routing — Route facts to the most specific category available (post-taxonomy expansion)
Key constraint: This project is independent of but complementary to the taxonomy expansion project (01-taxonomy-expansion.md). Layer 1-2 can be implemented immediately; Layer 3 depends on taxonomy expansion Wave 1.
Current State
| Item | Status |
|---|---|
| Internal root categories (depth 0) | 31 active — 7 original + 29 expansion − 5 deactivated (via migrations 0096, 0101, 0127) |
| NewsAPI.org categories | business, entertainment, general, health, science, sports, technology |
| GNews topics | breaking-news, world, nation, business, technology, entertainment, sports, science, health |
| TheNewsAPI categories | business, entertainment, food, general, health, lifestyle, politics, science, sports, tech, travel, world |
| Slug resolution | 3-tier fallback: exact match → provider-specific alias → universal alias (resolveTopicCategory()) |
| Unmapped category handling | Persistent audit trail via unmapped_category_log table (fire-and-forget INSERT) |
| Cron category dispatch | Depth-bounded: getActiveTopicCategories({ maxDepth: 0 }) — safe for subcategory expansion |
| Alias table | topic_category_aliases seeded with ~15 universal mappings (migration 0126) |
Category Mapping Gap Analysis
| Provider Slug | Eko Match? | Resolved Via |
|---|---|---|
business | Yes | Universal alias → business (expansion root) |
entertainment | Yes | Universal alias → entertainment (expansion root) |
general | Yes | Universal alias → current-events |
health | Yes | Universal alias → science |
technology | Yes | Direct match (expansion root) |
nation | Yes | Universal alias → current-events |
world | Yes | Universal alias → current-events |
food | Yes | Universal alias → food-beverage (expansion root) |
lifestyle | Yes | Universal alias → culture |
politics | Yes | Universal alias → governments (expansion root) |
tech | Yes | Universal alias → technology (expansion root) |
travel | Yes | Direct match (expansion root) |
breaking-news | Yes | Universal alias → current-events |
sports | Yes | Direct match |
science | Yes | Direct match |
Result: All 15 unique provider slugs now resolve — 4 via direct match, 11 via universal aliases in topic_category_aliases. Unmapped slugs are logged to unmapped_category_log for audit.
Key File References
| File | Purpose | Lines |
|---|---|---|
apps/worker-facts/src/handlers/extract-facts.ts | Topic resolution fallback chain | 62-81 |
packages/db/src/drizzle/fact-engine-queries.ts | getTopicCategoryBySlug() exact match | 772-780 |
packages/db/src/drizzle/fact-engine-queries.ts | getActiveTopicCategories() no depth filter | 896-902 |
apps/web/app/api/cron/ingest-news/route.ts | Cron dispatches per category per provider | 64-85 |
apps/worker-ingest/src/handlers/ingest-news.ts | Provider fetch functions and category params | 39-304 |
packages/db/src/drizzle/schema.ts | stories.category field (free text) | 747 |
Challenges
Wave 1: Category Alias Table
Challenge 1.1: Create topic_category_aliases Migration
Requirement: Create a SQL migration that adds a topic_category_aliases table for mapping external provider slugs to internal topic_category IDs.
Acceptance Criteria:
- Migration file exists at
supabase/migrations/0126_add_topic_category_aliases.sql - Table has columns:
id UUID PK,external_slug TEXT NOT NULL,provider TEXT(nullable = universal),topic_category_id UUID NOT NULL REFERENCES topic_categories(id),created_at TIMESTAMPTZ - Unique constraint on
(external_slug, provider)with COALESCE for NULL provider - RLS enabled: public SELECT for authenticated/anon, service_role ALL
- Seed data maps all known provider slugs to existing topic_categories
-
SELECT count(*) FROM topic_category_aliasesreturns >= 15 rows Evaluation: PASS Owner: db-migration-operator
Challenge 1.2: Add Drizzle Schema for Aliases
Requirement: Add topicCategoryAliases table definition to the Drizzle schema.
Acceptance Criteria:
-
topicCategoryAliasestable defined inpackages/db/src/drizzle/schema.ts - Relations defined:
topicCategoryAliases→topicCategories -
bun run typecheckpasses Evaluation: PASS Owner: db-migration-operator
Challenge 1.3: Create resolveTopicCategory() Query
Requirement: Replace direct getTopicCategoryBySlug() calls with a new resolveTopicCategory() that tries alias fallback.
Acceptance Criteria:
- Function exists in
packages/db/src/drizzle/fact-engine-queries.ts - Signature:
resolveTopicCategory(slug: string, options?: { provider?: string; storyId?: string }) - Resolution order: (1) exact slug match, (2) provider-specific alias, (3) universal alias
- Returns the resolved
topic_categoryrow or null - Exported from
@eko/db -
bun run typecheckpasses Evaluation: PASS Owner: ingest-engineer
Challenge 1.4: Update extract-facts to Use resolveTopicCategory()
Requirement: The EXTRACT_FACTS handler uses resolveTopicCategory() instead of getTopicCategoryBySlug().
Acceptance Criteria:
-
extract-facts.tsline 71 callsresolveTopicCategory(story.category, { storyId: story_id })instead ofgetTopicCategoryBySlug(story.category) - Provider information available on the story record (joined through news_sources)
- Existing behavior preserved when alias table is empty
-
bun run typecheckpasses Evaluation: PASS Owner: ingest-engineer
Wave 2: Unmapped Category Audit
Challenge 2.1: Create unmapped_category_log Table
Requirement: Create a table to persistently log categories that could not be resolved. Acceptance Criteria:
- Table added to the same migration as aliases (
0126_add_topic_category_aliases.sql) - Columns:
id UUID PK,external_slug TEXT NOT NULL,provider TEXT,story_id UUID REFERENCES stories(id),logged_at TIMESTAMPTZ DEFAULT NOW() - Index on
(external_slug, provider)for aggregation queries - RLS: service_role ALL only (internal audit data) Evaluation: PASS Owner: db-migration-operator
Challenge 2.2: Log Unmapped Categories in resolveTopicCategory()
Requirement: When resolveTopicCategory() returns null, insert a row into unmapped_category_log.
Acceptance Criteria:
-
resolveTopicCategory()accepts optionalstoryIdparameter for audit logging - On null resolution, inserts audit log row (fire-and-forget, does not block)
- Does not throw on logging failure (
.catch(() => {})) -
bun run typecheckpasses Evaluation: PASS Owner: ingest-engineer
Challenge 2.3: Unmapped Category Audit Script
Requirement: Create a script to report unmapped categories for manual alias creation. Acceptance Criteria:
- Script exists at
scripts/audit-unmapped-categories.ts - Groups by
(external_slug, provider)with count and most recentlogged_at - Output sorted by count descending (most-dropped categories first)
- Optionally outputs SQL INSERT statements for quick alias creation
-
bun scripts/audit-unmapped-categories.tsruns without errors Evaluation: PENDING Owner: ingest-engineer
Wave 3: Depth-Aware Routing
Challenge 3.1: Cron Uses maxDepth for Category Dispatch
Requirement: The ingest-news cron route dispatches messages only for root-level (depth 0) categories. Acceptance Criteria:
- Cron calls
getActiveTopicCategories({ maxDepth: 0 })(ingest-news, topic-quotas, generate-evergreen) - Message count per cron run stays bounded at ~31 active roots × providers
- Existing ingestion behavior unchanged for current root categories Evaluation: PASS Owner: cron-scheduler
Challenge 3.2: Provider-Aware Category Dispatch
Requirement: The cron route sends only categories that the specific provider supports, using the alias table in reverse. Acceptance Criteria:
- New query:
getProviderCategories(provider: string)returns slugs this provider understands - Cron sends
INGEST_NEWSmessages with provider-native slugs, not internal slugs - Reduces wasted API calls (no sending
recordsto NewsAPI which only understands 7 categories) -
bun run typecheckpasses Evaluation: PENDING Owner: cron-scheduler
Challenge 3.3: Subcategory Routing for Extracted Facts
Requirement: After fact extraction, optionally refine topic_category_id from root to most specific matching subcategory.
Acceptance Criteria:
- Post-extraction step checks if entity matches a leaf-level
topic_categoriesrow - Uses entity name matching against
topic_categories.slugat depth > 0 - Falls back to root category if no subcategory match found
- Enabled via feature flag (not active by default)
-
bun run typecheckpasses Evaluation: PENDING Owner: fact-engineer
Quality Tier
Verification Checklist
-
SELECT count(*) FROM topic_category_aliasesreturns >= 15 rows -
resolveTopicCategory('business')returns a valid topic_category (not null) -
resolveTopicCategory('nonexistent')returns null and logs tounmapped_category_log -
bun run typecheckpasses -
bun run lintpasses - Existing fact extraction for stories with known categories still works
- No increase in silently dropped stories after alias table is populated
Evaluation Summary
| Challenge | Result |
|---|---|
| 1.1 Create topic_category_aliases Migration | PASS |
| 1.2 Add Drizzle Schema for Aliases | PASS |
| 1.3 Create resolveTopicCategory() Query | PASS |
| 1.4 Update extract-facts Handler | PASS |
| 2.1 Create unmapped_category_log Table | PASS |
| 2.2 Log Unmapped Categories | PASS |
| 2.3 Unmapped Category Audit Script | PENDING |
| 3.1 Cron Uses maxDepth | PASS |
| 3.2 Provider-Aware Category Dispatch | PENDING |
| 3.3 Subcategory Routing | PENDING |
Score: 7/10 PASS
Implementation Notes
- Wave 1 COMPLETE — alias table (
0126_add_topic_category_aliases.sql), Drizzle schema,resolveTopicCategory(), and extract-facts integration all implemented and pushed to production. - Wave 2 partially complete —
unmapped_category_logtable and logging are live; audit script (2.3) still needed for operational visibility. - Challenge 3.1 COMPLETE — all 3 cron routes use
maxDepth: 0, safe for subcategory expansion. - Challenge 3.2 (provider-aware dispatch) — optimization for reducing wasted API calls; not blocking.
- Challenge 3.3 depends on taxonomy expansion Wave 2 — subcategory routing requires subcategory rows to exist.
- The alias table should be re-seeded whenever new root categories are added (taxonomy expansion Challenge 1.0) to map provider slugs to more specific categories.
Related Documents
- Taxonomy Expansion — Materializes the full category hierarchy
- Challenge Content Seeding TODO — Parent tracking document
- Challenge Content Rules — CC-010 and CC-011 govern coherence