Fact Ingestion — Source of Truth Map

Three pipelines produce facts. They share infrastructure but have distinct triggers, AI functions, and operational controls. This document maps the single authoritative source for each component.

Shared Spine

These files are the SOT for concerns that span all three pipelines.

Concern	SOT File	Notes
Queue message schemas	`packages/shared/src/schemas.ts`	Every message type across all pipelines; the contract between triggers, workers, and handlers
Queue routing & creators	`packages/queue/src/index.ts`	Queue names, message factory functions, consumer wiring
Config / env vars	`packages/config/src/index.ts` → `getFactEngineConfig()`	Typed config for API keys, thresholds, feature gates
DB schema	`packages/db/src/drizzle/schema.ts`	Tables: `news_sources`, `stories`, `seed_entry_queue`, `fact_records`, `fact_challenge_content`, `ai_model_tier_config`, `ingestion_runs`
DB queries	`packages/db/src/drizzle/fact-engine-queries.ts`	All insert/read/update functions for the above tables
Validation pipeline	`apps/worker-validate/src/handlers/validate-fact.ts`	4-phase: structural → consistency → cross-model → evidence
Cost tracking	`packages/ai/src/cost-tracker.ts` → `recordCost()`	Fire-and-forget cost logging for all AI calls
Enrichment context	`packages/ai/src/enrichment.ts`	8 free APIs (KG, Wikidata, Wikipedia, GDELT, SportsDB, MusicBrainz, Nominatim, Open Library)
Model tier config	`ai_model_tier_config` DB table	DB-driven model selection, 60s cache, no restart needed
Task → tier mapping	`packages/ai/src/fact-engine.ts` → `TASK_TIER_MAP`	Maps task names to model tiers (default/mid/high)
Observability	`ingestion_runs` table + `packages/observability/`	Per-run tracking with status, counts, errors
Admin controls	`apps/admin/app/(dashboard)/pipeline/actions.ts`	Manual triggers for all 3 pipelines

Pipeline Discriminator

All three pipelines write to fact_records with different source_type values:

Pipeline	`source_type`	`expires_at`
News ingestion	`news_extraction`	30 days from creation
Curated seeding	`file_seed`	NULL (permanent)
Evergreen	`ai_generated`	NULL (permanent)

News Ingestion

Fetches articles from external news APIs, clusters them into stories, then extracts one structured fact per story.

Flow

Vercel Cron (15 min) → INGEST_NEWS → fetch + normalize → insert news_sources
  → CLUSTER_STORIES → TF-IDF clustering → create/update stories
    → EXTRACT_FACTS (at source_count >= 3) → AI extraction → fact_record
      → RESOLVE_IMAGE → image cascade → VALIDATE_FACT

SOT Map

Component	SOT File
Cron trigger	`apps/web/app/api/cron/ingest-news/route.ts`
Fetch + normalize	`apps/worker-ingest/src/handlers/ingest-news.ts`
Providers (5)	Same file: `fetchFromNewsApi`, `fetchFromGNews`, `fetchFromTheNewsApi`, `fetchFromNewsdata`, `fetchFromEventRegistry`
Clustering	`apps/worker-ingest/src/handlers/cluster-stories.ts`
Image resolution	`apps/worker-ingest/src/handlers/resolve-image.ts`
Fact extraction	`apps/worker-facts/src/handlers/extract-facts.ts`
Provider API keys	`.env.local` via `getFactEngineConfig()`

Key Thresholds

Threshold	Value	Location
Min article text length	400 chars	`ingest-news.ts`
Cluster batch threshold	5 articles	`ingest-news.ts`
Cosine similarity threshold	0.6	`cluster-stories.ts`
Extraction trigger	source_count >= 3	`cluster-stories.ts`
Notability threshold	0.6	`extract-facts.ts` / `getFactEngineConfig()`
Max article sources per extraction	5	`extract-facts.ts`
Max chars per source	1500	`extract-facts.ts`

Env Vars

NEWS_API_KEY              # newsapi.org
GOOGLE_NEWS_API_KEY       # gnews.io
THENEWS_API_KEY           # thenewsapi.com
NEWSDATA_API_KEY          # newsdata.io
EVENT_REGISTRY_API_KEY    # eventregistry.org / newsapi.ai
UNSPLASH_ACCESS_KEY       # image cascade
PEXELS_API_KEY            # image cascade

Curated Seeding

Generates named entities per topic, explodes each into 10–100 structured facts, then generates challenge content.

Flow

generate-curated-entries.ts → seed_entry_queue (pending)
  → bulk-enqueue.ts → EXPLODE_CATEGORY_ENTRY
    → AI explosion → IMPORT_FACTS → fact_records
      → VALIDATE_FACT
generate-challenge-content.ts → GENERATE_CHALLENGE_CONTENT → fact_challenge_content

SOT Map

Component	SOT File
Topic taxonomy	`packages/ai/src/config/categories.ts` (44 roots, 162 subcategories)
Entity generation	`scripts/seed/generate-curated-entries.ts`
Queue dispatch	`scripts/seed/bulk-enqueue.ts`
Explosion handler	`apps/worker-facts/src/handlers/explode-entry.ts`
AI explosion function	`packages/ai/src/seed-explosion.ts` → `explodeCategoryEntry()`
Richness ranges	Same file, `RICHNESS_RANGES` constant
Fact import handler	`apps/worker-facts/src/handlers/import-facts.ts`
Challenge AI function	`packages/ai/src/challenge-content.ts` → `generateChallengeContent()`
Challenge rules (aggregator)	`packages/ai/src/challenge-content-rules.ts`
Challenge difficulty	`packages/ai/src/config/challenge-difficulty.ts`
Challenge style rules	`packages/ai/src/config/challenge-style-rules.ts`
Challenge style voices	`packages/ai/src/config/challenge-style-voices.ts`
Challenge voice constitution	`packages/ai/src/config/challenge-voice.ts`
Banned patterns + style list	`packages/ai/src/config/challenge-banned-patterns.ts`
Taxonomy voice/rules	`packages/ai/src/config/taxonomy-rules-data.ts` + `taxonomy-voices-data.ts`
Operational directives	`docs/projects/seeding/SEED.md`
Model eligibility results	`scripts/seed/.llm-test-data/eligibility.jsonl`

Richness Tiers

Tier	Facts per Entity	Typical Topics
high	50–100	Entertainment, sports, people
medium	20–50	Geography, science, animals
low	10–20	Business, design, fashion

Challenge Styles (5 pre-generated)

multiple_choice, fill_the_gap, direct_question, reverse_lookup, free_text

Two additional styles (conversational, progressive_image_reveal) are generated at runtime (CC-003).

Utility Scripts

Script	Purpose
`generate-curated-entries.ts`	AI-generate entity names
`bulk-enqueue.ts`	Dispatch entries to explosion workers
`generate-challenge-content.ts`	Batch challenge generation
`cleanup-content.ts`	Rewrite titles/context
`backfill-fact-nulls.ts`	Fill NULL metadata
`seed-from-files.ts`	Parse XLSX/DOCX/CSV and seed
`llm-fact-quality-testing.ts`	Model eligibility testing
`materialize-entity-categories.ts`	Create depth-2/3 categories from entities
`regen-voice-pass.ts`	Regenerate voice-pass content
`rewrite-challenge-defects.ts`	Fix challenge quality defects
`cleanup-seed-queue.ts`	Remove garbage entries
`deepen-topic-paths.ts`	AI-reclassify to deeper categories
`remap-topic-paths.ts`	Batch fix malformed topic paths

Evergreen

AI-generated timeless facts, not tied to news events. Currently gated off in production.

Flow

Vercel Cron (daily 3 AM UTC, unscheduled) or Admin manual trigger
  → GENERATE_EVERGREEN per active topic
    → AI generation → fact_records (source_type: ai_generated, expires_at: NULL)
      → VALIDATE_FACT (multi_phase)

SOT Map

Component	SOT File
Cron trigger	`apps/web/app/api/cron/generate-evergreen/route.ts` (unscheduled)
Feature gate	`EVERGREEN_ENABLED` env var (default: `false`)
Daily quota	`EVERGREEN_DAILY_QUOTA` env var (default: `20`)
Worker handler	`apps/worker-facts/src/handlers/generate-evergreen.ts`
AI function	`packages/ai/src/fact-engine.ts` → `generateEvergreenFacts()`
Model tier	`TASK_TIER_MAP`: `evergreen_generation → 'mid'`
Active topics query	`fact-engine-queries.ts` → `getActiveTopicCategoriesWithSchemas()`
Admin manual trigger	`apps/admin/app/(dashboard)/pipeline/actions.ts` → `triggerGenerateEvergreen()`

Production Status

Cron route exists but is not in vercel.json — see docs/APP-CONTROL.md
Feature-gated by EVERGREEN_ENABLED=false
Can be triggered manually from the admin pipeline controls page

#Fact Ingestion — Source of Truth Map

#Shared Spine

#Pipeline Discriminator

#News Ingestion

#Flow

#SOT Map

#Key Thresholds

#Env Vars

#Curated Seeding

#Flow

#SOT Map

#Richness Tiers

#Challenge Styles (5 pre-generated)

#Utility Scripts

#Evergreen

#Flow

#SOT Map

#Production Status

Fact Ingestion — Source of Truth Map

Shared Spine

Pipeline Discriminator

News Ingestion

Flow

SOT Map

Key Thresholds

Env Vars

Curated Seeding

Flow

SOT Map

Richness Tiers

Challenge Styles (5 pre-generated)

Utility Scripts

Evergreen

Flow

SOT Map

Production Status