Fact Ingestion — Source of Truth Map
Three pipelines produce facts. They share infrastructure but have distinct triggers, AI functions, and operational controls. This document maps the single authoritative source for each component.
Shared Spine
These files are the SOT for concerns that span all three pipelines.
| Concern | SOT File | Notes |
|---|---|---|
| Queue message schemas | packages/shared/src/schemas.ts | Every message type across all pipelines; the contract between triggers, workers, and handlers |
| Queue routing & creators | packages/queue/src/index.ts | Queue names, message factory functions, consumer wiring |
| Config / env vars | packages/config/src/index.ts → getFactEngineConfig() | Typed config for API keys, thresholds, feature gates |
| DB schema | packages/db/src/drizzle/schema.ts | Tables: news_sources, stories, seed_entry_queue, fact_records, fact_challenge_content, ai_model_tier_config, ingestion_runs |
| DB queries | packages/db/src/drizzle/fact-engine-queries.ts | All insert/read/update functions for the above tables |
| Validation pipeline | apps/worker-validate/src/handlers/validate-fact.ts | 4-phase: structural → consistency → cross-model → evidence |
| Cost tracking | packages/ai/src/cost-tracker.ts → recordCost() | Fire-and-forget cost logging for all AI calls |
| Enrichment context | packages/ai/src/enrichment.ts | 8 free APIs (KG, Wikidata, Wikipedia, GDELT, SportsDB, MusicBrainz, Nominatim, Open Library) |
| Model tier config | ai_model_tier_config DB table | DB-driven model selection, 60s cache, no restart needed |
| Task → tier mapping | packages/ai/src/fact-engine.ts → TASK_TIER_MAP | Maps task names to model tiers (default/mid/high) |
| Observability | ingestion_runs table + packages/observability/ | Per-run tracking with status, counts, errors |
| Admin controls | apps/admin/app/(dashboard)/pipeline/actions.ts | Manual triggers for all 3 pipelines |
Pipeline Discriminator
All three pipelines write to fact_records with different source_type values:
| Pipeline | source_type | expires_at |
|---|---|---|
| News ingestion | news_extraction | 30 days from creation |
| Curated seeding | file_seed | NULL (permanent) |
| Evergreen | ai_generated | NULL (permanent) |
News Ingestion
Fetches articles from external news APIs, clusters them into stories, then extracts one structured fact per story.
Flow
Vercel Cron (15 min) → INGEST_NEWS → fetch + normalize → insert news_sources
→ CLUSTER_STORIES → TF-IDF clustering → create/update stories
→ EXTRACT_FACTS (at source_count >= 3) → AI extraction → fact_record
→ RESOLVE_IMAGE → image cascade → VALIDATE_FACT
SOT Map
| Component | SOT File |
|---|---|
| Cron trigger | apps/web/app/api/cron/ingest-news/route.ts |
| Fetch + normalize | apps/worker-ingest/src/handlers/ingest-news.ts |
| Providers (5) | Same file: fetchFromNewsApi, fetchFromGNews, fetchFromTheNewsApi, fetchFromNewsdata, fetchFromEventRegistry |
| Clustering | apps/worker-ingest/src/handlers/cluster-stories.ts |
| Image resolution | apps/worker-ingest/src/handlers/resolve-image.ts |
| Fact extraction | apps/worker-facts/src/handlers/extract-facts.ts |
| Provider API keys | .env.local via getFactEngineConfig() |
Key Thresholds
| Threshold | Value | Location |
|---|---|---|
| Min article text length | 400 chars | ingest-news.ts |
| Cluster batch threshold | 5 articles | ingest-news.ts |
| Cosine similarity threshold | 0.6 | cluster-stories.ts |
| Extraction trigger | source_count >= 3 | cluster-stories.ts |
| Notability threshold | 0.6 | extract-facts.ts / getFactEngineConfig() |
| Max article sources per extraction | 5 | extract-facts.ts |
| Max chars per source | 1500 | extract-facts.ts |
Env Vars
NEWS_API_KEY # newsapi.org
GOOGLE_NEWS_API_KEY # gnews.io
THENEWS_API_KEY # thenewsapi.com
NEWSDATA_API_KEY # newsdata.io
EVENT_REGISTRY_API_KEY # eventregistry.org / newsapi.ai
UNSPLASH_ACCESS_KEY # image cascade
PEXELS_API_KEY # image cascade
Curated Seeding
Generates named entities per topic, explodes each into 10–100 structured facts, then generates challenge content.
Flow
generate-curated-entries.ts → seed_entry_queue (pending)
→ bulk-enqueue.ts → EXPLODE_CATEGORY_ENTRY
→ AI explosion → IMPORT_FACTS → fact_records
→ VALIDATE_FACT
generate-challenge-content.ts → GENERATE_CHALLENGE_CONTENT → fact_challenge_content
SOT Map
| Component | SOT File |
|---|---|
| Topic taxonomy | packages/ai/src/config/categories.ts (44 roots, 162 subcategories) |
| Entity generation | scripts/seed/generate-curated-entries.ts |
| Queue dispatch | scripts/seed/bulk-enqueue.ts |
| Explosion handler | apps/worker-facts/src/handlers/explode-entry.ts |
| AI explosion function | packages/ai/src/seed-explosion.ts → explodeCategoryEntry() |
| Richness ranges | Same file, RICHNESS_RANGES constant |
| Fact import handler | apps/worker-facts/src/handlers/import-facts.ts |
| Challenge AI function | packages/ai/src/challenge-content.ts → generateChallengeContent() |
| Challenge rules (aggregator) | packages/ai/src/challenge-content-rules.ts |
| Challenge difficulty | packages/ai/src/config/challenge-difficulty.ts |
| Challenge style rules | packages/ai/src/config/challenge-style-rules.ts |
| Challenge style voices | packages/ai/src/config/challenge-style-voices.ts |
| Challenge voice constitution | packages/ai/src/config/challenge-voice.ts |
| Banned patterns + style list | packages/ai/src/config/challenge-banned-patterns.ts |
| Taxonomy voice/rules | packages/ai/src/config/taxonomy-rules-data.ts + taxonomy-voices-data.ts |
| Operational directives | docs/projects/seeding/SEED.md |
| Model eligibility results | scripts/seed/.llm-test-data/eligibility.jsonl |
Richness Tiers
| Tier | Facts per Entity | Typical Topics |
|---|---|---|
| high | 50–100 | Entertainment, sports, people |
| medium | 20–50 | Geography, science, animals |
| low | 10–20 | Business, design, fashion |
Challenge Styles (5 pre-generated)
multiple_choice, fill_the_gap, direct_question, reverse_lookup, free_text
Two additional styles (conversational, progressive_image_reveal) are generated at runtime (CC-003).
Utility Scripts
| Script | Purpose |
|---|---|
generate-curated-entries.ts | AI-generate entity names |
bulk-enqueue.ts | Dispatch entries to explosion workers |
generate-challenge-content.ts | Batch challenge generation |
cleanup-content.ts | Rewrite titles/context |
backfill-fact-nulls.ts | Fill NULL metadata |
seed-from-files.ts | Parse XLSX/DOCX/CSV and seed |
llm-fact-quality-testing.ts | Model eligibility testing |
materialize-entity-categories.ts | Create depth-2/3 categories from entities |
regen-voice-pass.ts | Regenerate voice-pass content |
rewrite-challenge-defects.ts | Fix challenge quality defects |
cleanup-seed-queue.ts | Remove garbage entries |
deepen-topic-paths.ts | AI-reclassify to deeper categories |
remap-topic-paths.ts | Batch fix malformed topic paths |
Evergreen
AI-generated timeless facts, not tied to news events. Currently gated off in production.
Flow
Vercel Cron (daily 3 AM UTC, unscheduled) or Admin manual trigger
→ GENERATE_EVERGREEN per active topic
→ AI generation → fact_records (source_type: ai_generated, expires_at: NULL)
→ VALIDATE_FACT (multi_phase)
SOT Map
| Component | SOT File |
|---|---|
| Cron trigger | apps/web/app/api/cron/generate-evergreen/route.ts (unscheduled) |
| Feature gate | EVERGREEN_ENABLED env var (default: false) |
| Daily quota | EVERGREEN_DAILY_QUOTA env var (default: 20) |
| Worker handler | apps/worker-facts/src/handlers/generate-evergreen.ts |
| AI function | packages/ai/src/fact-engine.ts → generateEvergreenFacts() |
| Model tier | TASK_TIER_MAP: evergreen_generation → 'mid' |
| Active topics query | fact-engine-queries.ts → getActiveTopicCategoriesWithSchemas() |
| Admin manual trigger | apps/admin/app/(dashboard)/pipeline/actions.ts → triggerGenerateEvergreen() |
Production Status
- Cron route exists but is not in
vercel.json— seedocs/APP-CONTROL.md - Feature-gated by
EVERGREEN_ENABLED=false - Can be triggered manually from the admin pipeline controls page