News Pipeline — Detailed Flow

How Eko turns breaking news into verified, structured fact cards every 15 minutes — from external news API to the user's feed.

Overview

The news pipeline is Eko's primary source of timely content. It runs on a 15-minute cron cycle:

Fetch articles from active news providers
Deduplicate and quality-filter
Cluster related articles into stories
Extract structured facts via AI
Validate, resolve images, generate challenges
Publish to the blended feed

News-derived facts have a 30-day expiry — they're perishable content tied to real-world events. High-engagement facts can auto-promote to permanent.

End-to-End Flow

  Vercel Cron fires every 15 minutes
  (apps/web/app/api/cron/ingest-news/route.ts)
         │
         ▼
  ┌─ PHASE 1: PROVIDER DETECTION ────────────────────────────────────┐
  │  Check which API keys are configured                              │
  │  Query active root-level topic categories (maxDepth: 0)           │
  │  Enqueue 1 INGEST_NEWS per provider × per category                │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ PHASE 2: FETCH & NORMALIZE ─────────────────────────────────────┐
  │  Worker: worker-ingest                                            │
  │  Route to provider client (fetchFromEventRegistry, etc.)          │
  │  Normalize response → StandardArticle contract                    │
  │  Quality filter: title + description ≥ 400 chars                  │
  │  Dedup: ON CONFLICT DO NOTHING on (provider, external_id)         │
  │  Insert qualifying articles into news_sources                     │
  └──────────────────────────────────────────────────┬────────────────┘
         │  ≥ 5 new articles inserted?
         ▼
  ┌─ PHASE 3: CLUSTERING ───────────────────────────────────────────┐
  │  Worker: worker-ingest                                           │
  │  TF-IDF cosine similarity on titles + descriptions               │
  │  24-hour time window                                             │
  │  Similarity threshold: 0.6                                       │
  │  Result: story records linking multiple news_source rows          │
  └──────────────────────────────────────────────────┬────────────────┘
         │  story.source_count ≥ 3?
         ▼
  ┌─ PHASE 4: AI FACT EXTRACTION ────────────────────────────────────┐
  │  Worker: worker-facts                                             │
  │  Load story + top 5 linked articles (max 1500 chars each)         │
  │  Resolve topic category (3-step alias fallback)                   │
  │  Load schema keys for topic                                       │
  │  Resolve enrichment context (optional, never blocks)              │
  │  AI extraction → title, facts{}, context, notability_score        │
  │  Notability ≥ 0.6 → insert fact_record                           │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ PHASE 5: VALIDATION ───────────────────────────────────────────┐
  │  Worker: worker-validate                                         │
  │  Strategy: multi_source (confidence from source count)            │
  │  5+ sources → 0.9 | 3-4 → 0.8 | 1-2 → 0.7                      │
  │  Pass: confidence ≥ 0.7 + no critical flags                      │
  └──────────────────────────┬──────────────────────┬────────────────┘
                             │                      │
                             ▼                      ▼
  ┌─ PHASE 6a: IMAGE ──────────────┐  ┌─ PHASE 6b: CHALLENGES ──────┐
  │  Wikipedia → TheSportsDB        │  │  6 pre-generated styles      │
  │  → Unsplash → Pexels            │  │  5-layer structure per style │
  │  → null (UI placeholder)        │  │  Quality rules enforced      │
  └─────────────────────────────────┘  └──────────────────────────────┘
                             │                      │
                             └──────────┬───────────┘
                                        ▼
                                Fact appears in feed
                           (40% recent + 10% exploration)

Phase 1: Provider Detection

Active Providers

Only two providers are currently active in production:

Provider	API	Content Quality	Free Tier	Status
Event Registry	`newsapi.ai/api/v1`	Full article body (3,000-5,000 chars avg)	2,000 tokens (~200K articles)	Active
Newsdata.io	`newsdata.io/api/1`	Title + description (paid: full body)	30 req/day	Active

Three legacy providers are deprecated and no longer used:

Provider	Status	Reason
NewsAPI.org	Deprecated	Truncated content, expensive ($449/mo for full body)
GNews	Deprecated	Truncated content, limited free tier
TheNewsAPI	Deprecated	Truncated content, low article count

Provider Priority

The cron checks API keys in order, enqueuing one INGEST_NEWS message per configured provider per active root-level topic category:

EVENT_REGISTRY_API_KEY → event_registry
NEWSDATA_API_KEY → newsdata
NEWS_API_KEY → newsapi (legacy)
GOOGLE_NEWS_API_KEY → gnews (legacy)
THENEWS_API_KEY → thenewsapi (legacy)

Root-level categories only (maxDepth: 0) to prevent quota explosion when subcategories exist.

Queue Message

INGEST_NEWS {
  provider: "event_registry",
  category: "science",
  max_results: 20
}

Phase 2: Fetch & Normalize

StandardArticle Contract

Every provider normalizes its response to a common shape:

Field	Type	Description
`externalId`	`string`	URL hash (Bun wyhash) or native UUID
`sourceName`	`string \| null`	Publication name ("Reuters", "BBC")
`sourceDomain`	`string \| null`	Hostname from article URL
`title`	`string`	Article headline
`description`	`string \| null`	Summary or snippet
`articleUrl`	`string`	Canonical article URL
`imageUrl`	`string \| null`	Hero image (resolved later if null)
`publishedAt`	`Date \| null`	Publication timestamp
`contentHash`	`string \| null`	Bun wyhash for cross-provider dedup
`fullContent`	`string \| null`	Full article body (v2 providers only)

Quality Filter

Articles must pass a minimum text length check:

title.length + description.length ≥ 400 characters

Articles below this threshold are discarded to prevent LLM hallucination from thin snippets. This is especially critical for legacy providers that return truncated content (~256 chars).

Deduplication

Articles are inserted with ON CONFLICT DO NOTHING on (provider, external_id). This provides:

Within-provider dedup: Same article fetched across cron cycles
Cross-provider dedup: contentHash can detect the same article from different providers

Clustering Trigger

If ≥ 5 new articles were inserted in this batch, a CLUSTER_STORIES message is enqueued.

Phase 3: Story Clustering

Why Cluster?

Without clustering, "SpaceX Launches Starship to Mars" reported by 15 outlets would produce 15 near-identical facts. Clustering groups articles about the same event into a single story.

Algorithm

Method: TF-IDF cosine similarity on article titles and descriptions
Time window: 24 hours (articles outside this window start a new cluster)
Similarity threshold: 0.6 (configurable)

Story Record

Each story contains:

Field	Description
`headline`	Best-representative headline
`summary`	AI-generated or best description
`source_count`	Number of linked articles (used for validation confidence)
`source_domains[]`	Distinct publication domains
`category`	Topic slug from the triggering ingest
`status`	`clustering` → `published` → `archived`

Extraction Trigger

A story becomes eligible for fact extraction when source_count ≥ 3. This threshold ensures the event has been independently reported by multiple outlets before the AI attempts extraction.

Phase 4: AI Fact Extraction

Topic Category Resolution

Before calling the AI, the handler resolves the story's category slug to an internal topic category using a 3-step fallback:

Exact slug match against topic_categories.slug
Provider-specific alias in topic_category_aliases (where provider = providerName)
Universal alias in topic_category_aliases (where provider IS NULL)
If no match: log to unmapped_category_log for audit, skip extraction

Enrichment Context

The handler optionally resolves enrichment data to ground the AI:

Always	Topic-Routed
Knowledge Graph	TheSportsDB (sports/*)
Wikidata	MusicBrainz (music/*)
Wikipedia	Nominatim (geography/*)
	Open Library (books/*)

All calls use Promise.allSettled() — enrichment never blocks extraction.

AI Call

extractFactsFromStory() receives:

Input	Description
Story headline + summary	What happened
Article texts (top 5, max 1500 chars each)	Source material
Schema keys	Required JSONB fields for this topic
Topic path	Classification context
Entity context	Enrichment data (optional)
Published date	Temporal grounding
Subcategory hierarchy	For classification

Output

Field	Description
`title`	Factual, Wikipedia-style label
`challengeTitle`	Theatrical, curiosity-provoking hook
`facts`	Structured JSONB conforming to schema
`context`	4-8 sentence narrative (Hook → Story → Connection)
`notabilityScore`	0.0-1.0 AI assessment
`notabilityReason`	One-sentence justification

Notability Gate

Facts scoring below 0.6 (configurable via NOTABILITY_THRESHOLD) are logged and discarded. This is the pipeline's first quality filter — it catches mundane or low-signal news.

Model Selection

Aspect	Detail
Task name	`fact_extraction`
Default tier	`mid` (accuracy is critical for news)
Model routing	DB-driven via `ai_model_tier_config`
Preferred models	gemini-2.5-flash, gpt-5-mini, claude-haiku-4-5

Fact Insertion

Each fact passing the notability gate is inserted with:

status: 'pending_validation'
source_type: 'news_extraction'
expires_at: NOW() + 30 days (news facts are perishable)
VALIDATE_FACT enqueued with strategy multi_source

Phase 5: Validation

News facts use the multi_source strategy, where confidence scales with independent source count:

Source Count	Confidence	Rationale
5+ sources	0.9	Widely corroborated
3-4 sources	0.8	Multiple independent reports
1-2 sources	0.7	Minimum threshold

Pass/Fail

Pass: confidence ≥ 0.7 AND no flags containing "critical"
On pass: status → validated, published_at set, fan-out triggered
On fail: status → rejected, excluded from feed

Post-Validation Fan-Out

Two independent queue messages, fired in parallel:

RESOLVE_IMAGE → worker-ingest (image cascade)
GENERATE_CHALLENGE_CONTENT → worker-facts (6 quiz styles)

Phase 6a: Image Resolution

Priority cascade (stops at first hit):

Priority	Source	Best For	Hit Rate
1	Wikipedia PageImages	Named entities	~80%
2	TheSportsDB	Sports teams/athletes	Sports only
3	Unsplash	Abstract/topical subjects	Fallback
4	Pexels	Alternative photo pool	Fallback
5	null	UI placeholder	Last resort

Idempotent: if the fact already has an image_url, the handler exits early.

Phase 6b: Challenge Content Generation

6 pre-generated styles per fact, each with a 5-layer structure:

Layer	Purpose
`setup_text`	2-4 sentences of freely shared context
`challenge_text`	Invitation to answer (must contain "you"/"your")
`reveal_correct`	Celebrates knowledge, teaches extra detail
`reveal_wrong`	Kind teaching, includes correct answer
`correct_answer`	3-6 sentence rich narrative for streaming display

Uses micro-batching: accumulates up to 5 queue messages over 500ms, single AI call per batch to amortize the ~5,200-token system prompt.

Timing: End-to-End Latency

Phase	Typical Duration	Bottleneck
Cron → INGEST_NEWS	0-15 min	Cron interval
Fetch articles	2-5 sec	API response time
Clustering	1-3 sec	TF-IDF computation
AI extraction	5-15 sec	LLM inference
Validation	< 1 sec	Multi-source is fast
Image resolution	2-10 sec	Wikipedia API
Challenge generation	10-30 sec	LLM inference for 6 styles
Total	~1-16 minutes	Dominated by cron interval

Real-World Example: "Fed Raises Interest Rates"

Phase 1: Cron fires

Event Registry key configured → enqueue INGEST_NEWS(event_registry, economics)
Newsdata key configured → enqueue INGEST_NEWS(newsdata, economics)

Phase 2: Fetch

Event Registry returns 25 articles about the rate decision (full body)
Newsdata returns 12 articles (title + description)
Quality filter removes 3 thin articles
Dedup removes 5 cross-provider duplicates
29 new news_sources rows inserted
≥ 5 threshold met → enqueue CLUSTER_STORIES

Phase 3: Clustering

TF-IDF groups 18 of the 29 articles into one story: "Federal Reserve Raises Rates to 5.75%"
Story source_count: 18, source_domains: ["reuters.com", "bbc.com", "cnbc.com", ...]
source_count ≥ 3 → enqueue EXTRACT_FACTS

Phase 4: Extraction

Load story + top 5 articles (Reuters, BBC, CNBC, WSJ, AP)
Topic resolves: economics → "Business & Economics"
Schema keys: institution, rate, previous_rate, effective_date, vote_split
Enrichment: Knowledge Graph returns "Federal Reserve" entity, Wikidata confirms

AI extracts:

title: "Federal Reserve Raises Federal Funds Rate to 5.75%"
challengeTitle: "The Number That Shook Wall Street This Week"
facts: { institution: "Federal Reserve", rate: "5.75%", previous_rate: "5.50%",
         effective_date: "2026-03-20", vote_split: "10-2" }
context: "The Federal Reserve raised its benchmark interest rate by 25 basis points..."
notabilityScore: 0.92

Score 0.92 ≥ 0.6 → insert fact_record, enqueue VALIDATE_FACT(multi_source)

Phase 5: Validation

18 independent sources → confidence 0.9
No critical flags → validated
Fan-out: RESOLVE_IMAGE + GENERATE_CHALLENGE_CONTENT

Phase 6: Image + Challenges

Wikipedia: "Federal Reserve" → building photo ✓
6 challenge styles generated:
- Multiple choice: "What rate did the Fed set? A) 5.50% B) 5.75% C) 6.00% D) 5.25%"
- Direct question: "What interest rate did the Federal Reserve set in March 2026?"
- etc.

Result

Fact appears in feed under "Business & Economics"
expires_at: 2026-04-22 (30 days)
Available to users within the next feed refresh

Environment Variables

News API Keys

Variable	Provider	Status
`EVENT_REGISTRY_API_KEY`	Event Registry	Active (recommended)
`NEWSDATA_API_KEY`	Newsdata.io	Active
`NEWS_API_KEY`	NewsAPI.org	Deprecated
`GOOGLE_NEWS_API_KEY`	GNews	Deprecated
`THENEWS_API_KEY`	TheNewsAPI	Deprecated

Processing Thresholds

Variable	Default	Description
`NEWS_INGESTION_INTERVAL_MINUTES`	15	Cron polling frequency
`FACT_EXTRACTION_BATCH_SIZE`	10	Stories per extraction batch
`VALIDATION_MIN_SOURCES`	2	Minimum sources for multi_source
`NOTABILITY_THRESHOLD`	0.6	Minimum notability score (0-1)

Cost Model

News API Costs

Tier	Monthly Cost
Development (free tiers)	$0
Budget Production (Event Registry $90/mo)	$90
Full Production (all providers)	~$610

Event Registry free tier: ~200K articles (lasts ~13 months at current volume).

AI Processing Costs (per fact)

Step	Cost
Fact extraction	~$0.01
Validation (phases 3-4)	~$0.003
Challenge content	~$0.006
Image resolution	$0
Total per fact	~$0.02

Key Files

File	Purpose
`apps/web/app/api/cron/ingest-news/route.ts`	Cron trigger, provider detection
`apps/worker-ingest/src/handlers/ingest-news.ts`	Provider clients, normalization, quality filter
`apps/worker-ingest/src/handlers/cluster-stories.ts`	TF-IDF clustering
`apps/worker-ingest/src/handlers/resolve-image.ts`	Image cascade
`apps/worker-facts/src/handlers/extract-facts.ts`	AI fact extraction
`apps/worker-validate/src/handlers/validate-fact.ts`	Validation + post-validation fan-out
`packages/ai/src/fact-engine.ts`	`extractFactsFromStory()`
`packages/ai/src/enrichment.ts`	Enrichment orchestrator
`packages/shared/src/schemas.ts`	Queue message Zod schemas
`packages/config/src/index.ts`	FactEngineConfig, thresholds

Fact Ingestion — Source of Truth Map — SOT references for all three pipelines
News-to-Challenge Ingestion Guide — Step-by-step walkthrough
Evergreen Pipeline — AI-generated timeless facts
Seeding Pipeline — Entity explosion and bootstrapping
News Category Strategy — Target categories and provider selection

#News Pipeline — Detailed Flow

#Overview

#End-to-End Flow

#Phase 1: Provider Detection

#Active Providers

#Provider Priority

#Queue Message

#Phase 2: Fetch & Normalize

#StandardArticle Contract

#Quality Filter

#Deduplication

#Clustering Trigger

#Phase 3: Story Clustering

#Why Cluster?

#Algorithm

#Story Record

#Extraction Trigger

#Phase 4: AI Fact Extraction

#Topic Category Resolution

#Enrichment Context

#AI Call

#Output

#Notability Gate

#Model Selection

#Fact Insertion

#Phase 5: Validation

#Pass/Fail

#Post-Validation Fan-Out

#Phase 6a: Image Resolution

#Phase 6b: Challenge Content Generation

#Timing: End-to-End Latency

#Real-World Example: "Fed Raises Interest Rates"

#Phase 1: Cron fires

#Phase 2: Fetch

#Phase 3: Clustering

#Phase 4: Extraction

#Phase 5: Validation

#Phase 6: Image + Challenges

#Result

#Environment Variables

#News API Keys

#Processing Thresholds

#Cost Model

#News API Costs

#AI Processing Costs (per fact)

#Key Files

#Related

News Pipeline — Detailed Flow

Overview

End-to-End Flow

Phase 1: Provider Detection

Active Providers

Provider Priority

Queue Message

Phase 2: Fetch & Normalize

StandardArticle Contract

Quality Filter

Deduplication

Clustering Trigger

Phase 3: Story Clustering

Why Cluster?

Algorithm

Story Record

Extraction Trigger

Phase 4: AI Fact Extraction

Topic Category Resolution

Enrichment Context

AI Call

Output

Notability Gate

Model Selection

Fact Insertion

Phase 5: Validation

Pass/Fail

Post-Validation Fan-Out

Phase 6a: Image Resolution

Phase 6b: Challenge Content Generation

Timing: End-to-End Latency

Real-World Example: "Fed Raises Interest Rates"

Phase 1: Cron fires

Phase 2: Fetch

Phase 3: Clustering

Phase 4: Extraction

Phase 5: Validation

Phase 6: Image + Challenges

Result

Environment Variables

News API Keys

Processing Thresholds

Cost Model

News API Costs

AI Processing Costs (per fact)

Key Files

Related