News Pipeline — Detailed Flow
How Eko turns breaking news into verified, structured fact cards every 15 minutes — from external news API to the user's feed.
Overview
The news pipeline is Eko's primary source of timely content. It runs on a 15-minute cron cycle:
- Fetch articles from active news providers
- Deduplicate and quality-filter
- Cluster related articles into stories
- Extract structured facts via AI
- Validate, resolve images, generate challenges
- Publish to the blended feed
News-derived facts have a 30-day expiry — they're perishable content tied to real-world events. High-engagement facts can auto-promote to permanent.
End-to-End Flow
Vercel Cron fires every 15 minutes
(apps/web/app/api/cron/ingest-news/route.ts)
│
▼
┌─ PHASE 1: PROVIDER DETECTION ────────────────────────────────────┐
│ Check which API keys are configured │
│ Query active root-level topic categories (maxDepth: 0) │
│ Enqueue 1 INGEST_NEWS per provider × per category │
└──────────────────────────────────────────────────┬────────────────┘
│
▼
┌─ PHASE 2: FETCH & NORMALIZE ─────────────────────────────────────┐
│ Worker: worker-ingest │
│ Route to provider client (fetchFromEventRegistry, etc.) │
│ Normalize response → StandardArticle contract │
│ Quality filter: title + description ≥ 400 chars │
│ Dedup: ON CONFLICT DO NOTHING on (provider, external_id) │
│ Insert qualifying articles into news_sources │
└──────────────────────────────────────────────────┬────────────────┘
│ ≥ 5 new articles inserted?
▼
┌─ PHASE 3: CLUSTERING ───────────────────────────────────────────┐
│ Worker: worker-ingest │
│ TF-IDF cosine similarity on titles + descriptions │
│ 24-hour time window │
│ Similarity threshold: 0.6 │
│ Result: story records linking multiple news_source rows │
└──────────────────────────────────────────────────┬────────────────┘
│ story.source_count ≥ 3?
▼
┌─ PHASE 4: AI FACT EXTRACTION ────────────────────────────────────┐
│ Worker: worker-facts │
│ Load story + top 5 linked articles (max 1500 chars each) │
│ Resolve topic category (3-step alias fallback) │
│ Load schema keys for topic │
│ Resolve enrichment context (optional, never blocks) │
│ AI extraction → title, facts{}, context, notability_score │
│ Notability ≥ 0.6 → insert fact_record │
└──────────────────────────────────────────────────┬────────────────┘
│
▼
┌─ PHASE 5: VALIDATION ───────────────────────────────────────────┐
│ Worker: worker-validate │
│ Strategy: multi_source (confidence from source count) │
│ 5+ sources → 0.9 | 3-4 → 0.8 | 1-2 → 0.7 │
│ Pass: confidence ≥ 0.7 + no critical flags │
└──────────────────────────┬──────────────────────┬────────────────┘
│ │
▼ ▼
┌─ PHASE 6a: IMAGE ──────────────┐ ┌─ PHASE 6b: CHALLENGES ──────┐
│ Wikipedia → TheSportsDB │ │ 6 pre-generated styles │
│ → Unsplash → Pexels │ │ 5-layer structure per style │
│ → null (UI placeholder) │ │ Quality rules enforced │
└─────────────────────────────────┘ └──────────────────────────────┘
│ │
└──────────┬───────────┘
▼
Fact appears in feed
(40% recent + 10% exploration)
Phase 1: Provider Detection
Active Providers
Only two providers are currently active in production:
| Provider | API | Content Quality | Free Tier | Status |
|---|---|---|---|---|
| Event Registry | newsapi.ai/api/v1 | Full article body (3,000-5,000 chars avg) | 2,000 tokens (~200K articles) | Active |
| Newsdata.io | newsdata.io/api/1 | Title + description (paid: full body) | 30 req/day | Active |
Three legacy providers are deprecated and no longer used:
| Provider | Status | Reason |
|---|---|---|
| NewsAPI.org | Deprecated | Truncated content, expensive ($449/mo for full body) |
| GNews | Deprecated | Truncated content, limited free tier |
| TheNewsAPI | Deprecated | Truncated content, low article count |
Provider Priority
The cron checks API keys in order, enqueuing one INGEST_NEWS message per configured provider per active root-level topic category:
EVENT_REGISTRY_API_KEY→event_registryNEWSDATA_API_KEY→newsdataNEWS_API_KEY→newsapi(legacy)GOOGLE_NEWS_API_KEY→gnews(legacy)THENEWS_API_KEY→thenewsapi(legacy)
Root-level categories only (maxDepth: 0) to prevent quota explosion when subcategories exist.
Queue Message
INGEST_NEWS {
provider: "event_registry",
category: "science",
max_results: 20
}
Phase 2: Fetch & Normalize
StandardArticle Contract
Every provider normalizes its response to a common shape:
| Field | Type | Description |
|---|---|---|
externalId | string | URL hash (Bun wyhash) or native UUID |
sourceName | string | null | Publication name ("Reuters", "BBC") |
sourceDomain | string | null | Hostname from article URL |
title | string | Article headline |
description | string | null | Summary or snippet |
articleUrl | string | Canonical article URL |
imageUrl | string | null | Hero image (resolved later if null) |
publishedAt | Date | null | Publication timestamp |
contentHash | string | null | Bun wyhash for cross-provider dedup |
fullContent | string | null | Full article body (v2 providers only) |
Quality Filter
Articles must pass a minimum text length check:
title.length + description.length ≥ 400 characters
Articles below this threshold are discarded to prevent LLM hallucination from thin snippets. This is especially critical for legacy providers that return truncated content (~256 chars).
Deduplication
Articles are inserted with ON CONFLICT DO NOTHING on (provider, external_id). This provides:
- Within-provider dedup: Same article fetched across cron cycles
- Cross-provider dedup:
contentHashcan detect the same article from different providers
Clustering Trigger
If ≥ 5 new articles were inserted in this batch, a CLUSTER_STORIES message is enqueued.
Phase 3: Story Clustering
Why Cluster?
Without clustering, "SpaceX Launches Starship to Mars" reported by 15 outlets would produce 15 near-identical facts. Clustering groups articles about the same event into a single story.
Algorithm
- Method: TF-IDF cosine similarity on article titles and descriptions
- Time window: 24 hours (articles outside this window start a new cluster)
- Similarity threshold: 0.6 (configurable)
Story Record
Each story contains:
| Field | Description |
|---|---|
headline | Best-representative headline |
summary | AI-generated or best description |
source_count | Number of linked articles (used for validation confidence) |
source_domains[] | Distinct publication domains |
category | Topic slug from the triggering ingest |
status | clustering → published → archived |
Extraction Trigger
A story becomes eligible for fact extraction when source_count ≥ 3. This threshold ensures the event has been independently reported by multiple outlets before the AI attempts extraction.
Phase 4: AI Fact Extraction
Topic Category Resolution
Before calling the AI, the handler resolves the story's category slug to an internal topic category using a 3-step fallback:
- Exact slug match against
topic_categories.slug - Provider-specific alias in
topic_category_aliases(whereprovider = providerName) - Universal alias in
topic_category_aliases(whereprovider IS NULL) - If no match: log to
unmapped_category_logfor audit, skip extraction
Enrichment Context
The handler optionally resolves enrichment data to ground the AI:
| Always | Topic-Routed |
|---|---|
| Knowledge Graph | TheSportsDB (sports/*) |
| Wikidata | MusicBrainz (music/*) |
| Wikipedia | Nominatim (geography/*) |
| Open Library (books/*) |
All calls use Promise.allSettled() — enrichment never blocks extraction.
AI Call
extractFactsFromStory() receives:
| Input | Description |
|---|---|
| Story headline + summary | What happened |
| Article texts (top 5, max 1500 chars each) | Source material |
| Schema keys | Required JSONB fields for this topic |
| Topic path | Classification context |
| Entity context | Enrichment data (optional) |
| Published date | Temporal grounding |
| Subcategory hierarchy | For classification |
Output
| Field | Description |
|---|---|
title | Factual, Wikipedia-style label |
challengeTitle | Theatrical, curiosity-provoking hook |
facts | Structured JSONB conforming to schema |
context | 4-8 sentence narrative (Hook → Story → Connection) |
notabilityScore | 0.0-1.0 AI assessment |
notabilityReason | One-sentence justification |
Notability Gate
Facts scoring below 0.6 (configurable via NOTABILITY_THRESHOLD) are logged and discarded. This is the pipeline's first quality filter — it catches mundane or low-signal news.
Model Selection
| Aspect | Detail |
|---|---|
| Task name | fact_extraction |
| Default tier | mid (accuracy is critical for news) |
| Model routing | DB-driven via ai_model_tier_config |
| Preferred models | gemini-2.5-flash, gpt-5-mini, claude-haiku-4-5 |
Fact Insertion
Each fact passing the notability gate is inserted with:
status: 'pending_validation'source_type: 'news_extraction'expires_at: NOW() + 30 days(news facts are perishable)VALIDATE_FACTenqueued with strategymulti_source
Phase 5: Validation
News facts use the multi_source strategy, where confidence scales with independent source count:
| Source Count | Confidence | Rationale |
|---|---|---|
| 5+ sources | 0.9 | Widely corroborated |
| 3-4 sources | 0.8 | Multiple independent reports |
| 1-2 sources | 0.7 | Minimum threshold |
Pass/Fail
- Pass: confidence ≥ 0.7 AND no flags containing "critical"
- On pass: status →
validated,published_atset, fan-out triggered - On fail: status →
rejected, excluded from feed
Post-Validation Fan-Out
Two independent queue messages, fired in parallel:
RESOLVE_IMAGE→ worker-ingest (image cascade)GENERATE_CHALLENGE_CONTENT→ worker-facts (6 quiz styles)
Phase 6a: Image Resolution
Priority cascade (stops at first hit):
| Priority | Source | Best For | Hit Rate |
|---|---|---|---|
| 1 | Wikipedia PageImages | Named entities | ~80% |
| 2 | TheSportsDB | Sports teams/athletes | Sports only |
| 3 | Unsplash | Abstract/topical subjects | Fallback |
| 4 | Pexels | Alternative photo pool | Fallback |
| 5 | null | UI placeholder | Last resort |
Idempotent: if the fact already has an image_url, the handler exits early.
Phase 6b: Challenge Content Generation
6 pre-generated styles per fact, each with a 5-layer structure:
| Layer | Purpose |
|---|---|
setup_text | 2-4 sentences of freely shared context |
challenge_text | Invitation to answer (must contain "you"/"your") |
reveal_correct | Celebrates knowledge, teaches extra detail |
reveal_wrong | Kind teaching, includes correct answer |
correct_answer | 3-6 sentence rich narrative for streaming display |
Uses micro-batching: accumulates up to 5 queue messages over 500ms, single AI call per batch to amortize the ~5,200-token system prompt.
Timing: End-to-End Latency
| Phase | Typical Duration | Bottleneck |
|---|---|---|
| Cron → INGEST_NEWS | 0-15 min | Cron interval |
| Fetch articles | 2-5 sec | API response time |
| Clustering | 1-3 sec | TF-IDF computation |
| AI extraction | 5-15 sec | LLM inference |
| Validation | < 1 sec | Multi-source is fast |
| Image resolution | 2-10 sec | Wikipedia API |
| Challenge generation | 10-30 sec | LLM inference for 6 styles |
| Total | ~1-16 minutes | Dominated by cron interval |
Real-World Example: "Fed Raises Interest Rates"
Phase 1: Cron fires
- Event Registry key configured → enqueue
INGEST_NEWS(event_registry, economics) - Newsdata key configured → enqueue
INGEST_NEWS(newsdata, economics)
Phase 2: Fetch
- Event Registry returns 25 articles about the rate decision (full body)
- Newsdata returns 12 articles (title + description)
- Quality filter removes 3 thin articles
- Dedup removes 5 cross-provider duplicates
- 29 new
news_sourcesrows inserted - ≥ 5 threshold met → enqueue
CLUSTER_STORIES
Phase 3: Clustering
- TF-IDF groups 18 of the 29 articles into one story: "Federal Reserve Raises Rates to 5.75%"
- Story
source_count: 18,source_domains: ["reuters.com", "bbc.com", "cnbc.com", ...] source_count ≥ 3→ enqueueEXTRACT_FACTS
Phase 4: Extraction
- Load story + top 5 articles (Reuters, BBC, CNBC, WSJ, AP)
- Topic resolves:
economics→ "Business & Economics" - Schema keys:
institution,rate,previous_rate,effective_date,vote_split - Enrichment: Knowledge Graph returns "Federal Reserve" entity, Wikidata confirms
- AI extracts:
title: "Federal Reserve Raises Federal Funds Rate to 5.75%" challengeTitle: "The Number That Shook Wall Street This Week" facts: { institution: "Federal Reserve", rate: "5.75%", previous_rate: "5.50%", effective_date: "2026-03-20", vote_split: "10-2" } context: "The Federal Reserve raised its benchmark interest rate by 25 basis points..." notabilityScore: 0.92 - Score 0.92 ≥ 0.6 → insert
fact_record, enqueueVALIDATE_FACT(multi_source)
Phase 5: Validation
- 18 independent sources → confidence 0.9
- No critical flags → validated
- Fan-out:
RESOLVE_IMAGE+GENERATE_CHALLENGE_CONTENT
Phase 6: Image + Challenges
- Wikipedia: "Federal Reserve" → building photo ✓
- 6 challenge styles generated:
- Multiple choice: "What rate did the Fed set? A) 5.50% B) 5.75% C) 6.00% D) 5.25%"
- Direct question: "What interest rate did the Federal Reserve set in March 2026?"
- etc.
Result
- Fact appears in feed under "Business & Economics"
expires_at: 2026-04-22(30 days)- Available to users within the next feed refresh
Environment Variables
News API Keys
| Variable | Provider | Status |
|---|---|---|
EVENT_REGISTRY_API_KEY | Event Registry | Active (recommended) |
NEWSDATA_API_KEY | Newsdata.io | Active |
NEWS_API_KEY | NewsAPI.org | Deprecated |
GOOGLE_NEWS_API_KEY | GNews | Deprecated |
THENEWS_API_KEY | TheNewsAPI | Deprecated |
Processing Thresholds
| Variable | Default | Description |
|---|---|---|
NEWS_INGESTION_INTERVAL_MINUTES | 15 | Cron polling frequency |
FACT_EXTRACTION_BATCH_SIZE | 10 | Stories per extraction batch |
VALIDATION_MIN_SOURCES | 2 | Minimum sources for multi_source |
NOTABILITY_THRESHOLD | 0.6 | Minimum notability score (0-1) |
Cost Model
News API Costs
| Tier | Monthly Cost |
|---|---|
| Development (free tiers) | $0 |
| Budget Production (Event Registry $90/mo) | $90 |
| Full Production (all providers) | ~$610 |
Event Registry free tier: ~200K articles (lasts ~13 months at current volume).
AI Processing Costs (per fact)
| Step | Cost |
|---|---|
| Fact extraction | ~$0.01 |
| Validation (phases 3-4) | ~$0.003 |
| Challenge content | ~$0.006 |
| Image resolution | $0 |
| Total per fact | ~$0.02 |
Key Files
| File | Purpose |
|---|---|
apps/web/app/api/cron/ingest-news/route.ts | Cron trigger, provider detection |
apps/worker-ingest/src/handlers/ingest-news.ts | Provider clients, normalization, quality filter |
apps/worker-ingest/src/handlers/cluster-stories.ts | TF-IDF clustering |
apps/worker-ingest/src/handlers/resolve-image.ts | Image cascade |
apps/worker-facts/src/handlers/extract-facts.ts | AI fact extraction |
apps/worker-validate/src/handlers/validate-fact.ts | Validation + post-validation fan-out |
packages/ai/src/fact-engine.ts | extractFactsFromStory() |
packages/ai/src/enrichment.ts | Enrichment orchestrator |
packages/shared/src/schemas.ts | Queue message Zod schemas |
packages/config/src/index.ts | FactEngineConfig, thresholds |
Related
- Fact Ingestion — Source of Truth Map — SOT references for all three pipelines
- News-to-Challenge Ingestion Guide — Step-by-step walkthrough
- Evergreen Pipeline — AI-generated timeless facts
- Seeding Pipeline — Entity explosion and bootstrapping
- News Category Strategy — Target categories and provider selection