News & Fact Engine
How Eko turns the world's news into structured, verified facts delivered as interactive challenges.
What It Does
The fact engine is a multi-stage pipeline that:
- Ingests articles from multiple news APIs every 15 minutes
- Filters articles by content length to prevent LLM hallucination from thin snippets
- Clusters related articles into stories using TF-IDF similarity
- Extracts structured facts from stories using AI
- Validates each fact through multi-tier verification
- Generates challenge content (quiz, recall, free-text) per style
- Resolves images through a priority cascade of free APIs
- Publishes to a blended feed with spaced repetition scheduling
The result: a continuously refreshed stream of verified, image-backed, challenge-ready fact cards.
Pipeline Architecture
News APIs Seed Files
(Event Registry, (XLSX, CSV,
Newsdata, NewsAPI, AI super-facts)
GNews, TheNewsAPI)
│ │
▼ │
┌─────────────┐ │
│ news_sources │ │
│ (dedup via │ │
│ content_hash) │
│ + fullContent │ │
└──────┬──────┘ │
│ quality filter │
│ (≥400 chars) │
│ CLUSTER_STORIES │
▼ │
┌─────────┐ │
│ stories │ │
└────┬────┘ │
│ EXTRACT_FACTS IMPORT_FACTS
▼ │
┌──────────────┐◄────────────┘
│ fact_records │
│ (pending → │──── RESOLVE_IMAGE ──► image cascade
│ validated → │──── VALIDATE_FACT ──► tiered verification
│ published) │──── GENERATE_CHALLENGE_CONTENT
└──────┬───────┘
│
▼
┌──────────────────────┐
│ fact_challenge_content│ Pre-generated per style
│ + RESOLVE_CHALLENGE │ + per-challenge image resolution
│ _IMAGE │
└──────────────────────┘
│
▼
User Feed (blended: 40% recent, 30% review-due, 20% evergreen, 10% explore)
News Providers
Five external news APIs feed the pipeline. V2 providers deliver full article bodies for higher-quality fact extraction; legacy providers return truncated snippets (~256 chars) and are kept for backwards compatibility.
V2 Providers (Full-Content)
| Provider | API | Auth | Free Tier | Prod Tier |
|---|---|---|---|---|
| Event Registry | newsapi.ai/api/v1 | Query: apiKey | 2,000 tokens (one-time, ~200K articles) | $90/mo 5K plan |
| Newsdata.io | newsdata.io/api/1 | Query: apikey | Title + description only | Paid tiers: full body |
Event Registry is the primary v2 provider — it returns full article bodies (avg 3,000-5,000 chars) on its free tier, enabling significantly better fact extraction than truncated snippets.
Legacy Providers (Truncated)
| Provider | API | Auth | Free Tier | Prod Tier |
|---|---|---|---|---|
| NewsAPI.org | newsapi.org/v2 | Header: X-Api-Key | 100 req/day, 24h delay | Business $449/mo |
| GNews | gnews.io/api/v4 | Query: apikey | 100 req/day, 12h delay | Essential ~$50/mo |
| TheNewsAPI | thenewsapi.com/v1 | Query: api_token | 100 req/day, 3 articles/req | Basic $19/mo |
Provider Selection Logic
The cron route (apps/web/api/cron/ingest-news) checks which providers have API keys configured and enqueues one INGEST_NEWS message per provider per active root-level topic category (queried with maxDepth: 0 to prevent quota explosion when subcategories exist).
V2 providers are checked first, then legacy providers:
EVENT_REGISTRY_API_KEYpresent → enqueueevent_registryNEWSDATA_API_KEYpresent → enqueuenewsdataNEWS_API_KEYpresent → enqueuenewsapiGOOGLE_NEWS_API_KEYpresent → enqueuegnewsTHENEWS_API_KEYpresent → enqueuethenewsapi
Quality Filter
All providers pass through a MIN_ARTICLE_TEXT_LENGTH filter (400 characters, measured as title.length + description.length). Articles below this threshold are discarded before database insertion to prevent the LLM from hallucinating facts from insufficient source material. This is especially important for legacy providers that return truncated content.
StandardArticle Contract
Every provider normalizes its API response to a StandardArticle shape before database insertion:
| Field | Type | Description |
|---|---|---|
externalId | string | Provider-unique ID (URL hash for NewsAPI/GNews, UUID for TheNewsAPI) |
sourceName | string | null | Publication name (e.g., "Reuters") |
sourceDomain | string | null | Hostname extracted from article URL |
title | string | Article headline |
description | string | null | Summary or snippet |
articleUrl | string | Canonical article URL |
imageUrl | string | null | Hero image URL (resolved later if null) |
publishedAt | Date | null | Publication timestamp |
contentHash | string | null | Bun wyhash of article content for dedup |
fullContent | string | null | Full article body (v2 providers only; null for legacy) |
Articles are inserted with ON CONFLICT DO NOTHING on (provider, external_id) for automatic cross-request deduplication.
Story Clustering
When enough new articles arrive (threshold: 5), they are clustered into stories using TF-IDF cosine similarity within a 24-hour time window. Articles about the same event are grouped into a single story, preventing duplicate fact extraction.
AI Fact Extraction
Each story is processed by AI to extract structured facts:
- Model: Preferred models are
gemini-2.5-flash,gpt-5-mini, andclaude-haiku-4-5. Model routing is DB-driven via theai_model_tier_configtable for runtime switching without restarts. Each model has a dedicated ModelAdapter (seepackages/ai/src/models/) that injects per-model prompt optimizations (suffix/prefix/override modes) to exploit strengths and mitigate weaknesses - Category Resolution: Topic categories are resolved via
resolveTopicCategory()— a 3-step alias fallback (exact slug match → provider-specific alias intopic_category_aliases→ universal alias). Unresolved slugs are logged tounmapped_category_logfor audit - Schema: Facts conform to topic-specific schemas defined in
fact_record_schemas - Output: Title, structured fact key-value pairs, context narrative, challenge title, notability score
- Expiry: News-derived facts get a 30-day expiry; high-engagement facts auto-promote to enduring
Fact Validation
Every fact passes through a 4-phase validation pipeline before reaching the public feed:
| Phase | Name | What It Does | Cost |
|---|---|---|---|
| 1 | Structural | Schema conformance, type validation, injection detection | $0 (code-only) |
| 2 | Consistency | Internal contradictions, taxonomy rule violations | $0 (code-only) |
| 3 | Cross-Model | AI adversarial verification via Gemini 2.5 Flash | ~$0.001 |
| 4 | Evidence | External API corroboration (Wikipedia, Wikidata) + AI reasoner (Gemini 2.5 Flash) | ~$0.002-0.005 |
Phases 1-2 are free code-only checks that catch ~40% of defective facts before any AI call is made. Phases 3-4 use Gemini 2.5 Flash for cross-provider verification and evidence corroboration with full phase-by-phase audit trails.
Facts progress through statuses: pending_validation → validated → published (or rejected).
Image Resolution
Every fact record gets an image through a priority cascade:
| Priority | Source | Coverage | Cost |
|---|---|---|---|
| 1 | Wikipedia PageImages | ~80% of named entities | Free, no key |
| 2 | TheSportsDB | Sports teams, athletes | Free key |
| 3 | Unsplash | Topical photos (landscapes, abstract) | Free key |
| 4 | Pexels | Topical photos (alternative pool) | Free key |
| 5 | null | UI shows placeholder | N/A |
Wikipedia is the primary source because facts are entity-centric and Wikipedia covers most notable entities. Unsplash and Pexels are topical fallbacks for abstract topics (e.g., "quantum computing breakthrough") where no entity-specific image exists.
Challenge-level images are resolved separately via RESOLVE_CHALLENGE_IMAGE, which runs the same cascade but stores results on fact_challenge_content rows rather than the parent fact_records row.
Attribution Requirements
| Source | Required Attribution |
|---|---|
| Wikipedia | "Image from Wikipedia: {page_title}" |
| TheSportsDB | "Image from TheSportsDB: {team_name}" |
| Unsplash | "Photo by {photographer} on Unsplash" |
| Pexels | "Photo by {photographer} on Pexels" |
Challenge Content Generation
After facts are validated, the GENERATE_CHALLENGE_CONTENT queue produces pre-generated challenge material for each style:
- statement_blank: "_____ is the capital of France"
- direct_question: "What is the capital of France?"
- fill_the_gap: Sentence with masked answer
- multiple_choice: Question with 4 options
- reverse_lookup: Given the answer, identify the subject
- free_text: Open-ended question with AI grading
Each piece of content includes a three-layer emotional arc: setup_text, challenge_text, reveal_correct, and reveal_wrong — following the Eko voice constitution.
Challenge content generation uses micro-batching: the worker accumulates up to 5 queue messages over a 500ms window, then makes a single AI call per batch to amortize the ~5,200-token system prompt.
Evergreen Generation
Beyond news-derived facts, the pipeline generates "evergreen" knowledge facts via AI for each topic category:
- Quota: configurable per-day (
EVERGREEN_DAILY_QUOTA, default: 20) - Topic balance: daily quotas per category prevent content monoculture
- These facts have no expiry and provide a stable content baseline
Feed Algorithm
The user feed blends four content streams:
| Stream | Weight | Source |
|---|---|---|
| Recent validated | 40% | Newly published facts |
| Review-due | 30% | Spaced repetition (SM-2 variant) |
| Evergreen | 20% | Knowledge facts with no expiry |
| Exploration | 10% | Random facts for discovery |
Cards are interleaved round-robin and include a userStatus badge (in-progress, review-due, mastered).
Environment Configuration
News API Keys
| Variable | Provider | Required |
|---|---|---|
EVENT_REGISTRY_API_KEY | Event Registry (NewsAPI.ai) | Optional (v2, recommended) |
NEWSDATA_API_KEY | Newsdata.io | Optional (v2) |
NEWS_API_KEY | NewsAPI.org | Optional (legacy) |
GOOGLE_NEWS_API_KEY | GNews | Optional (legacy) |
THENEWS_API_KEY | TheNewsAPI | Optional (legacy) |
Image API Keys
| Variable | Provider | Required |
|---|---|---|
| (none) | Wikipedia | Always available |
| (none) | TheSportsDB | Always available (free key: "3") |
UNSPLASH_ACCESS_KEY | Unsplash | Optional |
PEXELS_API_KEY | Pexels | Optional |
AI Provider Keys
| Variable | Provider | Required |
|---|---|---|
OPENAI_API_KEY | OpenAI | Required (default tier) |
ANTHROPIC_API_KEY | Anthropic | Optional (mid/high tier) |
GOOGLE_API_KEY | Google AI | Required for validation pipeline (Gemini 2.5 Flash) |
Note: Model routing is DB-driven via
ai_model_tier_config. Each model can have a dedicated ModelAdapter for per-model prompt optimization. See Model Code Isolation.
Processing Settings
| Variable | Default | Description |
|---|---|---|
NEWS_INGESTION_INTERVAL_MINUTES | 15 | Cron polling frequency |
FACT_EXTRACTION_BATCH_SIZE | 10 | Stories per extraction batch |
VALIDATION_MIN_SOURCES | 2 | Minimum sources for multi_source validation |
NOTABILITY_THRESHOLD | 0.6 | Minimum notability score (0-1) |
EVERGREEN_DAILY_QUOTA | 20 | Evergreen facts per day |
EVERGREEN_ENABLED | false | Enable evergreen generation |
Cost Model
Development (Free Tier)
| Component | Monthly Cost |
|---|---|
| Event Registry (free, 2,000 tokens) | $0 |
| Newsdata.io (free) | $0 |
| NewsAPI.org (free) | $0 |
| GNews (free) | $0 |
| TheNewsAPI (free) | $0 |
| All image APIs | $0 |
| Total | $0 |
With Event Registry as the primary provider, the free tier delivers high-quality full-body articles for approximately 13 months before token exhaustion.
Budget Production
| Component | Monthly Cost |
|---|---|
| Event Registry $90/mo plan | $90 |
| All image APIs | $0 |
| Total | ~$90 |
Delivers full-body articles at scale without legacy provider limitations.
Full Production
| Component | Monthly Cost |
|---|---|
| Event Registry $90/mo plan | $90 |
| NewsAPI.org Business | $449 |
| GNews Essential | ~$50 |
| TheNewsAPI Basic | $19 |
| All image APIs | $0 |
| Total | ~$610 |
Worker Architecture
Three dedicated workers process the pipeline:
| Worker | Queue Types | Purpose |
|---|---|---|
worker-ingest | INGEST_NEWS, CLUSTER_STORIES, RESOLVE_IMAGE, RESOLVE_CHALLENGE_IMAGE | Fetch, cluster, resolve images |
worker-facts | EXTRACT_FACTS, IMPORT_FACTS, GENERATE_EVERGREEN, EXPLODE_CATEGORY_ENTRY, FIND_SUPER_FACTS, GENERATE_CHALLENGE_CONTENT | AI extraction and content generation |
worker-validate | VALIDATE_FACT | Multi-tier fact verification |
All workers use Bun with Upstash Redis queues, exponential backoff polling, and scale-to-zero idle exit.
Key Files
| File | Purpose |
|---|---|
apps/worker-ingest/src/handlers/ingest-news.ts | News provider implementations and dispatch |
apps/worker-ingest/src/handlers/resolve-image.ts | Image resolution cascade (fact-level) |
apps/worker-ingest/src/handlers/resolve-challenge-image.ts | Image resolution cascade (challenge-level) |
apps/web/app/api/cron/ingest-news/route.ts | Cron that enqueues INGEST_NEWS per provider |
packages/shared/src/schemas.ts | Queue message Zod schemas |
packages/queue/src/index.ts | Queue client, message constructors |
packages/config/src/index.ts | FactEngineConfig and API key management |
packages/db/src/drizzle/schema.ts | news_sources, stories, fact_records tables |
packages/db/src/drizzle/fact-engine-queries.ts | Pipeline query functions |