News Pipeline — Detailed Flow

How Eko turns breaking news into verified, structured fact cards every 15 minutes — from external news API to the user's feed.

Overview

The news pipeline is Eko's primary source of timely content. It runs on a 15-minute cron cycle:

  1. Fetch articles from active news providers
  2. Deduplicate and quality-filter
  3. Cluster related articles into stories
  4. Extract structured facts via AI
  5. Validate, resolve images, generate challenges
  6. Publish to the blended feed

News-derived facts have a 30-day expiry — they're perishable content tied to real-world events. High-engagement facts can auto-promote to permanent.


End-to-End Flow

  Vercel Cron fires every 15 minutes
  (apps/web/app/api/cron/ingest-news/route.ts)
         │
         ▼
  ┌─ PHASE 1: PROVIDER DETECTION ────────────────────────────────────┐
  │  Check which API keys are configured                              │
  │  Query active root-level topic categories (maxDepth: 0)           │
  │  Enqueue 1 INGEST_NEWS per provider × per category                │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ PHASE 2: FETCH & NORMALIZE ─────────────────────────────────────┐
  │  Worker: worker-ingest                                            │
  │  Route to provider client (fetchFromEventRegistry, etc.)          │
  │  Normalize response → StandardArticle contract                    │
  │  Quality filter: title + description ≥ 400 chars                  │
  │  Dedup: ON CONFLICT DO NOTHING on (provider, external_id)         │
  │  Insert qualifying articles into news_sources                     │
  └──────────────────────────────────────────────────┬────────────────┘
         │  ≥ 5 new articles inserted?
         ▼
  ┌─ PHASE 3: CLUSTERING ───────────────────────────────────────────┐
  │  Worker: worker-ingest                                           │
  │  TF-IDF cosine similarity on titles + descriptions               │
  │  24-hour time window                                             │
  │  Similarity threshold: 0.6                                       │
  │  Result: story records linking multiple news_source rows          │
  └──────────────────────────────────────────────────┬────────────────┘
         │  story.source_count ≥ 3?
         ▼
  ┌─ PHASE 4: AI FACT EXTRACTION ────────────────────────────────────┐
  │  Worker: worker-facts                                             │
  │  Load story + top 5 linked articles (max 1500 chars each)         │
  │  Resolve topic category (3-step alias fallback)                   │
  │  Load schema keys for topic                                       │
  │  Resolve enrichment context (optional, never blocks)              │
  │  AI extraction → title, facts{}, context, notability_score        │
  │  Notability ≥ 0.6 → insert fact_record                           │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ PHASE 5: VALIDATION ───────────────────────────────────────────┐
  │  Worker: worker-validate                                         │
  │  Strategy: multi_source (confidence from source count)            │
  │  5+ sources → 0.9 | 3-4 → 0.8 | 1-2 → 0.7                      │
  │  Pass: confidence ≥ 0.7 + no critical flags                      │
  └──────────────────────────┬──────────────────────┬────────────────┘
                             │                      │
                             ▼                      ▼
  ┌─ PHASE 6a: IMAGE ──────────────┐  ┌─ PHASE 6b: CHALLENGES ──────┐
  │  Wikipedia → TheSportsDB        │  │  6 pre-generated styles      │
  │  → Unsplash → Pexels            │  │  5-layer structure per style │
  │  → null (UI placeholder)        │  │  Quality rules enforced      │
  └─────────────────────────────────┘  └──────────────────────────────┘
                             │                      │
                             └──────────┬───────────┘
                                        ▼
                                Fact appears in feed
                           (40% recent + 10% exploration)

Phase 1: Provider Detection

Active Providers

Only two providers are currently active in production:

ProviderAPIContent QualityFree TierStatus
Event Registrynewsapi.ai/api/v1Full article body (3,000-5,000 chars avg)2,000 tokens (~200K articles)Active
Newsdata.ionewsdata.io/api/1Title + description (paid: full body)30 req/dayActive

Three legacy providers are deprecated and no longer used:

ProviderStatusReason
NewsAPI.orgDeprecatedTruncated content, expensive ($449/mo for full body)
GNewsDeprecatedTruncated content, limited free tier
TheNewsAPIDeprecatedTruncated content, low article count

Provider Priority

The cron checks API keys in order, enqueuing one INGEST_NEWS message per configured provider per active root-level topic category:

  1. EVENT_REGISTRY_API_KEYevent_registry
  2. NEWSDATA_API_KEYnewsdata
  3. NEWS_API_KEYnewsapi (legacy)
  4. GOOGLE_NEWS_API_KEYgnews (legacy)
  5. THENEWS_API_KEYthenewsapi (legacy)

Root-level categories only (maxDepth: 0) to prevent quota explosion when subcategories exist.

Queue Message

INGEST_NEWS {
  provider: "event_registry",
  category: "science",
  max_results: 20
}

Phase 2: Fetch & Normalize

StandardArticle Contract

Every provider normalizes its response to a common shape:

FieldTypeDescription
externalIdstringURL hash (Bun wyhash) or native UUID
sourceNamestring | nullPublication name ("Reuters", "BBC")
sourceDomainstring | nullHostname from article URL
titlestringArticle headline
descriptionstring | nullSummary or snippet
articleUrlstringCanonical article URL
imageUrlstring | nullHero image (resolved later if null)
publishedAtDate | nullPublication timestamp
contentHashstring | nullBun wyhash for cross-provider dedup
fullContentstring | nullFull article body (v2 providers only)

Quality Filter

Articles must pass a minimum text length check:

title.length + description.length ≥ 400 characters

Articles below this threshold are discarded to prevent LLM hallucination from thin snippets. This is especially critical for legacy providers that return truncated content (~256 chars).

Deduplication

Articles are inserted with ON CONFLICT DO NOTHING on (provider, external_id). This provides:

  • Within-provider dedup: Same article fetched across cron cycles
  • Cross-provider dedup: contentHash can detect the same article from different providers

Clustering Trigger

If ≥ 5 new articles were inserted in this batch, a CLUSTER_STORIES message is enqueued.


Phase 3: Story Clustering

Why Cluster?

Without clustering, "SpaceX Launches Starship to Mars" reported by 15 outlets would produce 15 near-identical facts. Clustering groups articles about the same event into a single story.

Algorithm

  • Method: TF-IDF cosine similarity on article titles and descriptions
  • Time window: 24 hours (articles outside this window start a new cluster)
  • Similarity threshold: 0.6 (configurable)

Story Record

Each story contains:

FieldDescription
headlineBest-representative headline
summaryAI-generated or best description
source_countNumber of linked articles (used for validation confidence)
source_domains[]Distinct publication domains
categoryTopic slug from the triggering ingest
statusclusteringpublishedarchived

Extraction Trigger

A story becomes eligible for fact extraction when source_count ≥ 3. This threshold ensures the event has been independently reported by multiple outlets before the AI attempts extraction.


Phase 4: AI Fact Extraction

Topic Category Resolution

Before calling the AI, the handler resolves the story's category slug to an internal topic category using a 3-step fallback:

  1. Exact slug match against topic_categories.slug
  2. Provider-specific alias in topic_category_aliases (where provider = providerName)
  3. Universal alias in topic_category_aliases (where provider IS NULL)
  4. If no match: log to unmapped_category_log for audit, skip extraction

Enrichment Context

The handler optionally resolves enrichment data to ground the AI:

AlwaysTopic-Routed
Knowledge GraphTheSportsDB (sports/*)
WikidataMusicBrainz (music/*)
WikipediaNominatim (geography/*)
Open Library (books/*)

All calls use Promise.allSettled() — enrichment never blocks extraction.

AI Call

extractFactsFromStory() receives:

InputDescription
Story headline + summaryWhat happened
Article texts (top 5, max 1500 chars each)Source material
Schema keysRequired JSONB fields for this topic
Topic pathClassification context
Entity contextEnrichment data (optional)
Published dateTemporal grounding
Subcategory hierarchyFor classification

Output

FieldDescription
titleFactual, Wikipedia-style label
challengeTitleTheatrical, curiosity-provoking hook
factsStructured JSONB conforming to schema
context4-8 sentence narrative (Hook → Story → Connection)
notabilityScore0.0-1.0 AI assessment
notabilityReasonOne-sentence justification

Notability Gate

Facts scoring below 0.6 (configurable via NOTABILITY_THRESHOLD) are logged and discarded. This is the pipeline's first quality filter — it catches mundane or low-signal news.

Model Selection

AspectDetail
Task namefact_extraction
Default tiermid (accuracy is critical for news)
Model routingDB-driven via ai_model_tier_config
Preferred modelsgemini-2.5-flash, gpt-5-mini, claude-haiku-4-5

Fact Insertion

Each fact passing the notability gate is inserted with:

  • status: 'pending_validation'
  • source_type: 'news_extraction'
  • expires_at: NOW() + 30 days (news facts are perishable)
  • VALIDATE_FACT enqueued with strategy multi_source

Phase 5: Validation

News facts use the multi_source strategy, where confidence scales with independent source count:

Source CountConfidenceRationale
5+ sources0.9Widely corroborated
3-4 sources0.8Multiple independent reports
1-2 sources0.7Minimum threshold

Pass/Fail

  • Pass: confidence ≥ 0.7 AND no flags containing "critical"
  • On pass: status → validated, published_at set, fan-out triggered
  • On fail: status → rejected, excluded from feed

Post-Validation Fan-Out

Two independent queue messages, fired in parallel:

  1. RESOLVE_IMAGE → worker-ingest (image cascade)
  2. GENERATE_CHALLENGE_CONTENT → worker-facts (6 quiz styles)

Phase 6a: Image Resolution

Priority cascade (stops at first hit):

PrioritySourceBest ForHit Rate
1Wikipedia PageImagesNamed entities~80%
2TheSportsDBSports teams/athletesSports only
3UnsplashAbstract/topical subjectsFallback
4PexelsAlternative photo poolFallback
5nullUI placeholderLast resort

Idempotent: if the fact already has an image_url, the handler exits early.


Phase 6b: Challenge Content Generation

6 pre-generated styles per fact, each with a 5-layer structure:

LayerPurpose
setup_text2-4 sentences of freely shared context
challenge_textInvitation to answer (must contain "you"/"your")
reveal_correctCelebrates knowledge, teaches extra detail
reveal_wrongKind teaching, includes correct answer
correct_answer3-6 sentence rich narrative for streaming display

Uses micro-batching: accumulates up to 5 queue messages over 500ms, single AI call per batch to amortize the ~5,200-token system prompt.


Timing: End-to-End Latency

PhaseTypical DurationBottleneck
Cron → INGEST_NEWS0-15 minCron interval
Fetch articles2-5 secAPI response time
Clustering1-3 secTF-IDF computation
AI extraction5-15 secLLM inference
Validation< 1 secMulti-source is fast
Image resolution2-10 secWikipedia API
Challenge generation10-30 secLLM inference for 6 styles
Total~1-16 minutesDominated by cron interval

Real-World Example: "Fed Raises Interest Rates"

Phase 1: Cron fires

  • Event Registry key configured → enqueue INGEST_NEWS(event_registry, economics)
  • Newsdata key configured → enqueue INGEST_NEWS(newsdata, economics)

Phase 2: Fetch

  • Event Registry returns 25 articles about the rate decision (full body)
  • Newsdata returns 12 articles (title + description)
  • Quality filter removes 3 thin articles
  • Dedup removes 5 cross-provider duplicates
  • 29 new news_sources rows inserted
  • ≥ 5 threshold met → enqueue CLUSTER_STORIES

Phase 3: Clustering

  • TF-IDF groups 18 of the 29 articles into one story: "Federal Reserve Raises Rates to 5.75%"
  • Story source_count: 18, source_domains: ["reuters.com", "bbc.com", "cnbc.com", ...]
  • source_count ≥ 3 → enqueue EXTRACT_FACTS

Phase 4: Extraction

  • Load story + top 5 articles (Reuters, BBC, CNBC, WSJ, AP)
  • Topic resolves: economics → "Business & Economics"
  • Schema keys: institution, rate, previous_rate, effective_date, vote_split
  • Enrichment: Knowledge Graph returns "Federal Reserve" entity, Wikidata confirms
  • AI extracts:
    title: "Federal Reserve Raises Federal Funds Rate to 5.75%"
    challengeTitle: "The Number That Shook Wall Street This Week"
    facts: { institution: "Federal Reserve", rate: "5.75%", previous_rate: "5.50%",
             effective_date: "2026-03-20", vote_split: "10-2" }
    context: "The Federal Reserve raised its benchmark interest rate by 25 basis points..."
    notabilityScore: 0.92
    
  • Score 0.92 ≥ 0.6 → insert fact_record, enqueue VALIDATE_FACT(multi_source)

Phase 5: Validation

  • 18 independent sources → confidence 0.9
  • No critical flags → validated
  • Fan-out: RESOLVE_IMAGE + GENERATE_CHALLENGE_CONTENT

Phase 6: Image + Challenges

  • Wikipedia: "Federal Reserve" → building photo ✓
  • 6 challenge styles generated:
    • Multiple choice: "What rate did the Fed set? A) 5.50% B) 5.75% C) 6.00% D) 5.25%"
    • Direct question: "What interest rate did the Federal Reserve set in March 2026?"
    • etc.

Result

  • Fact appears in feed under "Business & Economics"
  • expires_at: 2026-04-22 (30 days)
  • Available to users within the next feed refresh

Environment Variables

News API Keys

VariableProviderStatus
EVENT_REGISTRY_API_KEYEvent RegistryActive (recommended)
NEWSDATA_API_KEYNewsdata.ioActive
NEWS_API_KEYNewsAPI.orgDeprecated
GOOGLE_NEWS_API_KEYGNewsDeprecated
THENEWS_API_KEYTheNewsAPIDeprecated

Processing Thresholds

VariableDefaultDescription
NEWS_INGESTION_INTERVAL_MINUTES15Cron polling frequency
FACT_EXTRACTION_BATCH_SIZE10Stories per extraction batch
VALIDATION_MIN_SOURCES2Minimum sources for multi_source
NOTABILITY_THRESHOLD0.6Minimum notability score (0-1)

Cost Model

News API Costs

TierMonthly Cost
Development (free tiers)$0
Budget Production (Event Registry $90/mo)$90
Full Production (all providers)~$610

Event Registry free tier: ~200K articles (lasts ~13 months at current volume).

AI Processing Costs (per fact)

StepCost
Fact extraction~$0.01
Validation (phases 3-4)~$0.003
Challenge content~$0.006
Image resolution$0
Total per fact~$0.02

Key Files

FilePurpose
apps/web/app/api/cron/ingest-news/route.tsCron trigger, provider detection
apps/worker-ingest/src/handlers/ingest-news.tsProvider clients, normalization, quality filter
apps/worker-ingest/src/handlers/cluster-stories.tsTF-IDF clustering
apps/worker-ingest/src/handlers/resolve-image.tsImage cascade
apps/worker-facts/src/handlers/extract-facts.tsAI fact extraction
apps/worker-validate/src/handlers/validate-fact.tsValidation + post-validation fan-out
packages/ai/src/fact-engine.tsextractFactsFromStory()
packages/ai/src/enrichment.tsEnrichment orchestrator
packages/shared/src/schemas.tsQueue message Zod schemas
packages/config/src/index.tsFactEngineConfig, thresholds