Fact Ingestion — Source of Truth Map

Three pipelines produce facts. They share infrastructure but have distinct triggers, AI functions, and operational controls. This document maps the single authoritative source for each component.


Shared Spine

These files are the SOT for concerns that span all three pipelines.

ConcernSOT FileNotes
Queue message schemaspackages/shared/src/schemas.tsEvery message type across all pipelines; the contract between triggers, workers, and handlers
Queue routing & creatorspackages/queue/src/index.tsQueue names, message factory functions, consumer wiring
Config / env varspackages/config/src/index.tsgetFactEngineConfig()Typed config for API keys, thresholds, feature gates
DB schemapackages/db/src/drizzle/schema.tsTables: news_sources, stories, seed_entry_queue, fact_records, fact_challenge_content, ai_model_tier_config, ingestion_runs
DB queriespackages/db/src/drizzle/fact-engine-queries.tsAll insert/read/update functions for the above tables
Validation pipelineapps/worker-validate/src/handlers/validate-fact.ts4-phase: structural → consistency → cross-model → evidence
Cost trackingpackages/ai/src/cost-tracker.tsrecordCost()Fire-and-forget cost logging for all AI calls
Enrichment contextpackages/ai/src/enrichment.ts8 free APIs (KG, Wikidata, Wikipedia, GDELT, SportsDB, MusicBrainz, Nominatim, Open Library)
Model tier configai_model_tier_config DB tableDB-driven model selection, 60s cache, no restart needed
Task → tier mappingpackages/ai/src/fact-engine.tsTASK_TIER_MAPMaps task names to model tiers (default/mid/high)
Observabilityingestion_runs table + packages/observability/Per-run tracking with status, counts, errors
Admin controlsapps/admin/app/(dashboard)/pipeline/actions.tsManual triggers for all 3 pipelines

Pipeline Discriminator

All three pipelines write to fact_records with different source_type values:

Pipelinesource_typeexpires_at
News ingestionnews_extraction30 days from creation
Curated seedingfile_seedNULL (permanent)
Evergreenai_generatedNULL (permanent)

News Ingestion

Fetches articles from external news APIs, clusters them into stories, then extracts one structured fact per story.

Flow

Vercel Cron (15 min) → INGEST_NEWS → fetch + normalize → insert news_sources
  → CLUSTER_STORIES → TF-IDF clustering → create/update stories
    → EXTRACT_FACTS (at source_count >= 3) → AI extraction → fact_record
      → RESOLVE_IMAGE → image cascade → VALIDATE_FACT

SOT Map

ComponentSOT File
Cron triggerapps/web/app/api/cron/ingest-news/route.ts
Fetch + normalizeapps/worker-ingest/src/handlers/ingest-news.ts
Providers (5)Same file: fetchFromNewsApi, fetchFromGNews, fetchFromTheNewsApi, fetchFromNewsdata, fetchFromEventRegistry
Clusteringapps/worker-ingest/src/handlers/cluster-stories.ts
Image resolutionapps/worker-ingest/src/handlers/resolve-image.ts
Fact extractionapps/worker-facts/src/handlers/extract-facts.ts
Provider API keys.env.local via getFactEngineConfig()

Key Thresholds

ThresholdValueLocation
Min article text length400 charsingest-news.ts
Cluster batch threshold5 articlesingest-news.ts
Cosine similarity threshold0.6cluster-stories.ts
Extraction triggersource_count >= 3cluster-stories.ts
Notability threshold0.6extract-facts.ts / getFactEngineConfig()
Max article sources per extraction5extract-facts.ts
Max chars per source1500extract-facts.ts

Env Vars

NEWS_API_KEY              # newsapi.org
GOOGLE_NEWS_API_KEY       # gnews.io
THENEWS_API_KEY           # thenewsapi.com
NEWSDATA_API_KEY          # newsdata.io
EVENT_REGISTRY_API_KEY    # eventregistry.org / newsapi.ai
UNSPLASH_ACCESS_KEY       # image cascade
PEXELS_API_KEY            # image cascade

Curated Seeding

Generates named entities per topic, explodes each into 10–100 structured facts, then generates challenge content.

Flow

generate-curated-entries.ts → seed_entry_queue (pending)
  → bulk-enqueue.ts → EXPLODE_CATEGORY_ENTRY
    → AI explosion → IMPORT_FACTS → fact_records
      → VALIDATE_FACT
generate-challenge-content.ts → GENERATE_CHALLENGE_CONTENT → fact_challenge_content

SOT Map

ComponentSOT File
Topic taxonomypackages/ai/src/config/categories.ts (44 roots, 162 subcategories)
Entity generationscripts/seed/generate-curated-entries.ts
Queue dispatchscripts/seed/bulk-enqueue.ts
Explosion handlerapps/worker-facts/src/handlers/explode-entry.ts
AI explosion functionpackages/ai/src/seed-explosion.tsexplodeCategoryEntry()
Richness rangesSame file, RICHNESS_RANGES constant
Fact import handlerapps/worker-facts/src/handlers/import-facts.ts
Challenge AI functionpackages/ai/src/challenge-content.tsgenerateChallengeContent()
Challenge rules (aggregator)packages/ai/src/challenge-content-rules.ts
Challenge difficultypackages/ai/src/config/challenge-difficulty.ts
Challenge style rulespackages/ai/src/config/challenge-style-rules.ts
Challenge style voicespackages/ai/src/config/challenge-style-voices.ts
Challenge voice constitutionpackages/ai/src/config/challenge-voice.ts
Banned patterns + style listpackages/ai/src/config/challenge-banned-patterns.ts
Taxonomy voice/rulespackages/ai/src/config/taxonomy-rules-data.ts + taxonomy-voices-data.ts
Operational directivesdocs/projects/seeding/SEED.md
Model eligibility resultsscripts/seed/.llm-test-data/eligibility.jsonl

Richness Tiers

TierFacts per EntityTypical Topics
high50–100Entertainment, sports, people
medium20–50Geography, science, animals
low10–20Business, design, fashion

Challenge Styles (5 pre-generated)

multiple_choice, fill_the_gap, direct_question, reverse_lookup, free_text

Two additional styles (conversational, progressive_image_reveal) are generated at runtime (CC-003).

Utility Scripts

ScriptPurpose
generate-curated-entries.tsAI-generate entity names
bulk-enqueue.tsDispatch entries to explosion workers
generate-challenge-content.tsBatch challenge generation
cleanup-content.tsRewrite titles/context
backfill-fact-nulls.tsFill NULL metadata
seed-from-files.tsParse XLSX/DOCX/CSV and seed
llm-fact-quality-testing.tsModel eligibility testing
materialize-entity-categories.tsCreate depth-2/3 categories from entities
regen-voice-pass.tsRegenerate voice-pass content
rewrite-challenge-defects.tsFix challenge quality defects
cleanup-seed-queue.tsRemove garbage entries
deepen-topic-paths.tsAI-reclassify to deeper categories
remap-topic-paths.tsBatch fix malformed topic paths

Evergreen

AI-generated timeless facts, not tied to news events. Currently gated off in production.

Flow

Vercel Cron (daily 3 AM UTC, unscheduled) or Admin manual trigger
  → GENERATE_EVERGREEN per active topic
    → AI generation → fact_records (source_type: ai_generated, expires_at: NULL)
      → VALIDATE_FACT (multi_phase)

SOT Map

ComponentSOT File
Cron triggerapps/web/app/api/cron/generate-evergreen/route.ts (unscheduled)
Feature gateEVERGREEN_ENABLED env var (default: false)
Daily quotaEVERGREEN_DAILY_QUOTA env var (default: 20)
Worker handlerapps/worker-facts/src/handlers/generate-evergreen.ts
AI functionpackages/ai/src/fact-engine.tsgenerateEvergreenFacts()
Model tierTASK_TIER_MAP: evergreen_generation → 'mid'
Active topics queryfact-engine-queries.tsgetActiveTopicCategoriesWithSchemas()
Admin manual triggerapps/admin/app/(dashboard)/pipeline/actions.tstriggerGenerateEvergreen()

Production Status

  • Cron route exists but is not in vercel.json — see docs/APP-CONTROL.md
  • Feature-gated by EVERGREEN_ENABLED=false
  • Can be triggered manually from the admin pipeline controls page