Taxonomy Coherence

Context

The Eko fact engine's ingestion pipeline has a taxonomy coherence gap: news API providers use their own category taxonomies (e.g., business, entertainment, health) that do not match our internal topic_categories slugs (e.g., finance, culture, science). When getTopicCategoryBySlug() fails to resolve a slug, the story is silently skipped — no facts are extracted.

This project closes the gap with three layers:

  1. Category alias normalization — Map external provider categories to internal topic_category IDs
  2. Unmapped category audit trail — Log unresolvable categories instead of silently dropping them
  3. Subcategory routing — Route facts to the most specific category available (post-taxonomy expansion)

Key constraint: This project is independent of but complementary to the taxonomy expansion project (01-taxonomy-expansion.md). Layer 1-2 can be implemented immediately; Layer 3 depends on taxonomy expansion Wave 1.

Current State

ItemStatus
Internal root categories (depth 0)31 active — 7 original + 29 expansion − 5 deactivated (via migrations 0096, 0101, 0127)
NewsAPI.org categoriesbusiness, entertainment, general, health, science, sports, technology
GNews topicsbreaking-news, world, nation, business, technology, entertainment, sports, science, health
TheNewsAPI categoriesbusiness, entertainment, food, general, health, lifestyle, politics, science, sports, tech, travel, world
Slug resolution3-tier fallback: exact match → provider-specific alias → universal alias (resolveTopicCategory())
Unmapped category handlingPersistent audit trail via unmapped_category_log table (fire-and-forget INSERT)
Cron category dispatchDepth-bounded: getActiveTopicCategories({ maxDepth: 0 }) — safe for subcategory expansion
Alias tabletopic_category_aliases seeded with ~15 universal mappings (migration 0126)

Category Mapping Gap Analysis

Provider SlugEko Match?Resolved Via
businessYesUniversal alias → business (expansion root)
entertainmentYesUniversal alias → entertainment (expansion root)
generalYesUniversal alias → current-events
healthYesUniversal alias → science
technologyYesDirect match (expansion root)
nationYesUniversal alias → current-events
worldYesUniversal alias → current-events
foodYesUniversal alias → food-beverage (expansion root)
lifestyleYesUniversal alias → culture
politicsYesUniversal alias → governments (expansion root)
techYesUniversal alias → technology (expansion root)
travelYesDirect match (expansion root)
breaking-newsYesUniversal alias → current-events
sportsYesDirect match
scienceYesDirect match

Result: All 15 unique provider slugs now resolve — 4 via direct match, 11 via universal aliases in topic_category_aliases. Unmapped slugs are logged to unmapped_category_log for audit.

Key File References

FilePurposeLines
apps/worker-facts/src/handlers/extract-facts.tsTopic resolution fallback chain62-81
packages/db/src/drizzle/fact-engine-queries.tsgetTopicCategoryBySlug() exact match772-780
packages/db/src/drizzle/fact-engine-queries.tsgetActiveTopicCategories() no depth filter896-902
apps/web/app/api/cron/ingest-news/route.tsCron dispatches per category per provider64-85
apps/worker-ingest/src/handlers/ingest-news.tsProvider fetch functions and category params39-304
packages/db/src/drizzle/schema.tsstories.category field (free text)747

Challenges

Wave 1: Category Alias Table

Challenge 1.1: Create topic_category_aliases Migration

Requirement: Create a SQL migration that adds a topic_category_aliases table for mapping external provider slugs to internal topic_category IDs. Acceptance Criteria:

  • Migration file exists at supabase/migrations/0126_add_topic_category_aliases.sql
  • Table has columns: id UUID PK, external_slug TEXT NOT NULL, provider TEXT (nullable = universal), topic_category_id UUID NOT NULL REFERENCES topic_categories(id), created_at TIMESTAMPTZ
  • Unique constraint on (external_slug, provider) with COALESCE for NULL provider
  • RLS enabled: public SELECT for authenticated/anon, service_role ALL
  • Seed data maps all known provider slugs to existing topic_categories
  • SELECT count(*) FROM topic_category_aliases returns >= 15 rows Evaluation: PASS Owner: db-migration-operator

Challenge 1.2: Add Drizzle Schema for Aliases

Requirement: Add topicCategoryAliases table definition to the Drizzle schema. Acceptance Criteria:

  • topicCategoryAliases table defined in packages/db/src/drizzle/schema.ts
  • Relations defined: topicCategoryAliasestopicCategories
  • bun run typecheck passes Evaluation: PASS Owner: db-migration-operator

Challenge 1.3: Create resolveTopicCategory() Query

Requirement: Replace direct getTopicCategoryBySlug() calls with a new resolveTopicCategory() that tries alias fallback. Acceptance Criteria:

  • Function exists in packages/db/src/drizzle/fact-engine-queries.ts
  • Signature: resolveTopicCategory(slug: string, options?: { provider?: string; storyId?: string })
  • Resolution order: (1) exact slug match, (2) provider-specific alias, (3) universal alias
  • Returns the resolved topic_category row or null
  • Exported from @eko/db
  • bun run typecheck passes Evaluation: PASS Owner: ingest-engineer

Challenge 1.4: Update extract-facts to Use resolveTopicCategory()

Requirement: The EXTRACT_FACTS handler uses resolveTopicCategory() instead of getTopicCategoryBySlug(). Acceptance Criteria:

  • extract-facts.ts line 71 calls resolveTopicCategory(story.category, { storyId: story_id }) instead of getTopicCategoryBySlug(story.category)
  • Provider information available on the story record (joined through news_sources)
  • Existing behavior preserved when alias table is empty
  • bun run typecheck passes Evaluation: PASS Owner: ingest-engineer

Wave 2: Unmapped Category Audit

Challenge 2.1: Create unmapped_category_log Table

Requirement: Create a table to persistently log categories that could not be resolved. Acceptance Criteria:

  • Table added to the same migration as aliases (0126_add_topic_category_aliases.sql)
  • Columns: id UUID PK, external_slug TEXT NOT NULL, provider TEXT, story_id UUID REFERENCES stories(id), logged_at TIMESTAMPTZ DEFAULT NOW()
  • Index on (external_slug, provider) for aggregation queries
  • RLS: service_role ALL only (internal audit data) Evaluation: PASS Owner: db-migration-operator

Challenge 2.2: Log Unmapped Categories in resolveTopicCategory()

Requirement: When resolveTopicCategory() returns null, insert a row into unmapped_category_log. Acceptance Criteria:

  • resolveTopicCategory() accepts optional storyId parameter for audit logging
  • On null resolution, inserts audit log row (fire-and-forget, does not block)
  • Does not throw on logging failure (.catch(() => {}))
  • bun run typecheck passes Evaluation: PASS Owner: ingest-engineer

Challenge 2.3: Unmapped Category Audit Script

Requirement: Create a script to report unmapped categories for manual alias creation. Acceptance Criteria:

  • Script exists at scripts/audit-unmapped-categories.ts
  • Groups by (external_slug, provider) with count and most recent logged_at
  • Output sorted by count descending (most-dropped categories first)
  • Optionally outputs SQL INSERT statements for quick alias creation
  • bun scripts/audit-unmapped-categories.ts runs without errors Evaluation: PENDING Owner: ingest-engineer

Wave 3: Depth-Aware Routing

Challenge 3.1: Cron Uses maxDepth for Category Dispatch

Requirement: The ingest-news cron route dispatches messages only for root-level (depth 0) categories. Acceptance Criteria:

  • Cron calls getActiveTopicCategories({ maxDepth: 0 }) (ingest-news, topic-quotas, generate-evergreen)
  • Message count per cron run stays bounded at ~31 active roots × providers
  • Existing ingestion behavior unchanged for current root categories Evaluation: PASS Owner: cron-scheduler

Challenge 3.2: Provider-Aware Category Dispatch

Requirement: The cron route sends only categories that the specific provider supports, using the alias table in reverse. Acceptance Criteria:

  • New query: getProviderCategories(provider: string) returns slugs this provider understands
  • Cron sends INGEST_NEWS messages with provider-native slugs, not internal slugs
  • Reduces wasted API calls (no sending records to NewsAPI which only understands 7 categories)
  • bun run typecheck passes Evaluation: PENDING Owner: cron-scheduler

Challenge 3.3: Subcategory Routing for Extracted Facts

Requirement: After fact extraction, optionally refine topic_category_id from root to most specific matching subcategory. Acceptance Criteria:

  • Post-extraction step checks if entity matches a leaf-level topic_categories row
  • Uses entity name matching against topic_categories.slug at depth > 0
  • Falls back to root category if no subcategory match found
  • Enabled via feature flag (not active by default)
  • bun run typecheck passes Evaluation: PENDING Owner: fact-engineer

Quality Tier

Verification Checklist

  • SELECT count(*) FROM topic_category_aliases returns >= 15 rows
  • resolveTopicCategory('business') returns a valid topic_category (not null)
  • resolveTopicCategory('nonexistent') returns null and logs to unmapped_category_log
  • bun run typecheck passes
  • bun run lint passes
  • Existing fact extraction for stories with known categories still works
  • No increase in silently dropped stories after alias table is populated

Evaluation Summary

ChallengeResult
1.1 Create topic_category_aliases MigrationPASS
1.2 Add Drizzle Schema for AliasesPASS
1.3 Create resolveTopicCategory() QueryPASS
1.4 Update extract-facts HandlerPASS
2.1 Create unmapped_category_log TablePASS
2.2 Log Unmapped CategoriesPASS
2.3 Unmapped Category Audit ScriptPENDING
3.1 Cron Uses maxDepthPASS
3.2 Provider-Aware Category DispatchPENDING
3.3 Subcategory RoutingPENDING

Score: 7/10 PASS


Implementation Notes

  • Wave 1 COMPLETE — alias table (0126_add_topic_category_aliases.sql), Drizzle schema, resolveTopicCategory(), and extract-facts integration all implemented and pushed to production.
  • Wave 2 partially completeunmapped_category_log table and logging are live; audit script (2.3) still needed for operational visibility.
  • Challenge 3.1 COMPLETE — all 3 cron routes use maxDepth: 0, safe for subcategory expansion.
  • Challenge 3.2 (provider-aware dispatch) — optimization for reducing wasted API calls; not blocking.
  • Challenge 3.3 depends on taxonomy expansion Wave 2 — subcategory routing requires subcategory rows to exist.
  • The alias table should be re-seeded whenever new root categories are added (taxonomy expansion Challenge 1.0) to map provider slugs to more specific categories.