News-to-Challenge Ingestion Guide

A step-by-step walkthrough of how a breaking news article becomes a playable challenge card in Eko. This guide traces the full journey through the pipeline — from external news API to the user's feed.

The Journey at a Glance

  Breaking news published    "SpaceX launches Starship to Mars"
         │
         ▼
  ┌─ PHASE 1: INGEST ──────────────────────────────────────────┐
  │  Cron fires every 15 min → fetches from up to 5 news APIs  │
  │  Articles normalized to StandardArticle, quality-filtered   │
  │  (≥400 chars), deduped by hash, inserted into news_sources  │
  └──────────────────────────────────────────────┬──────────────┘
         │  ≥ 5 new articles?
         ▼
  ┌─ PHASE 2: CLUSTER ─────────────────────────────────────────┐
  │  TF-IDF cosine similarity within 24h window                 │
  │  Multiple articles about the same event → one story         │
  │  Prevents duplicate fact extraction                         │
  └──────────────────────────────────────────────┬──────────────┘
         │
         ▼
  ┌─ PHASE 3: EXTRACT ─────────────────────────────────────────┐
  │  AI reads story + linked articles                           │
  │  Produces: title, facts{}, context, challenge_title         │
  │  Notability score ≥ 0.6 → insert fact_record               │
  └──────────────────────────────────────────────┬──────────────┘
         │
         ▼
  ┌─ PHASE 4: VALIDATE ────────────────────────────────────────┐
  │  Strategy picked by source type (multi_source for news)     │
  │  Confidence ≥ 0.7 + no critical flags → validated           │
  │  published_at set, RESOLVE_IMAGE + GENERATE_CHALLENGE fired │
  └──────────────────────────────────────────────┬──────────────┘
         │                                       │
         ▼                                       ▼
  ┌─ PHASE 5a: IMAGE ──────┐   ┌─ PHASE 5b: CHALLENGE CONTENT ┐
  │  Wikipedia → SportsDB   │   │  AI generates 3-5 styles:     │
  │  → Unsplash → Pexels    │   │  MC, FTG, DQ, SB, RL, FT     │
  │  → null (placeholder)   │   │  Each with 5-layer structure   │
  └─────────────────────────┘   └──────────────────────────────┘
         │                                       │
         └───────────────┬───────────────────────┘
                         ▼
  ┌─ PHASE 6: FEED ────────────────────────────────────────────┐
  │  Blended algorithm: 40% recent, 30% review-due,            │
  │  20% evergreen, 10% exploration                            │
  │  User sees challenge card → plays it → spaced repetition   │
  └────────────────────────────────────────────────────────────┘

Phase 1: News Ingestion

Trigger

A Vercel cron job fires every 15 minutes, hitting apps/web/app/api/cron/ingest-news/route.ts. The route is auth-gated via CRON_SECRET (checked in both Authorization and x-vercel-cron-authorization headers).

Provider Detection

The cron checks which news API keys are configured, prioritizing v2 full-content providers:

Env VariableProviderQueue ValueContent
EVENT_REGISTRY_API_KEYEvent Registryevent_registryFull body (v2)
NEWSDATA_API_KEYNewsdata.ionewsdataFull body on paid tiers (v2)
NEWS_API_KEYNewsAPI.orgnewsapiTruncated (legacy)
GOOGLE_NEWS_API_KEYGNewsgnewsTruncated (legacy)
THENEWS_API_KEYTheNewsAPIthenewsapiTruncated (legacy)

For each configured provider, it fetches active topic categories from the database using getActiveTopicCategories({ maxDepth: 0 }) and enqueues one INGEST_NEWS message per provider per category. The maxDepth: 0 filter restricts queries to root-level categories only, preventing quota explosion when subcategories are added to the taxonomy.

Queue Message: INGEST_NEWS

{
  type: "INGEST_NEWS",
  payload: {
    provider: "event_registry" | "newsdata" | "newsapi" | "gnews" | "thenewsapi",
    category: "science",     // topic category slug
    max_results: 20
  }
}

Worker: worker-ingest

What the Handler Does

apps/worker-ingest/src/handlers/ingest-news.ts:

  1. Routes to the correct provider client (fetchFromEventRegistry, fetchFromNewsdata, fetchFromNewsApi, fetchFromGNews, fetchFromTheNewsApi)
  2. Each provider normalizes its response to the StandardArticle contract:
    • externalId — stable hash of the article URL (or native UUID for TheNewsAPI)
    • contentHash — Bun wyhash of article content for cross-provider dedup
    • title, description, articleUrl, imageUrl, publishedAt, sourceName, sourceDomain
    • fullContent — full article body (v2 providers only; null for legacy)
  3. Articles are filtered by MIN_ARTICLE_TEXT_LENGTH (400 chars, measured as title + description length). Articles below this threshold are discarded to prevent LLM hallucination from thin snippets
  4. Qualifying articles are bulk-inserted into news_sources with ON CONFLICT DO NOTHING on (provider, external_id), which silently skips duplicates
  5. If ≥ 5 new articles were inserted, a CLUSTER_STORIES message is enqueued

Observability

Every handler call creates an ingestion_runs record tracking status (completed/failed), counts (recordsProcessed, recordsCreated), and timing. Metrics are emitted via @eko/observability (ingest.news_sources_inserted, ingest.news_sources_duplicates).


Phase 2: Story Clustering

Queue Message: CLUSTER_STORIES

{
  type: "CLUSTER_STORIES",
  payload: {
    news_source_ids: ["uuid-1", "uuid-2", ...],
    time_window_hours: 24
  }
}

Worker: worker-ingest

What Happens

Articles about the same real-world event are grouped into a single story using TF-IDF cosine similarity on article titles and descriptions, within a 24-hour time window.

  • Each story gets a headline, summary, source_count, and source_domains[]
  • The source_count is critical later: it determines validation confidence during the multi-source strategy
  • Stories progress through statuses: clusteringpublishedarchived

Why Clustering Matters

Without clustering, "SpaceX Starship Launches" reported by 15 outlets would produce 15 near-identical facts. Clustering deduplicates at the story level so AI extraction runs once per event, not once per article.


Phase 3: Fact Extraction

Queue Message: EXTRACT_FACTS

{
  type: "EXTRACT_FACTS",
  payload: {
    story_id: "uuid",
    topic_category_id: "uuid"    // optional, resolved from story if absent
  }
}

Worker: worker-facts

What the Handler Does

apps/worker-facts/src/handlers/extract-facts.ts:

  1. Fetches the story with all its linked news_sources

  2. Resolves the topic category using resolveTopicCategory(), a 3-step alias fallback:

    1. Exact slug match against topic_categories.slug
    2. Provider-specific alias lookup in topic_category_aliases (where provider = providerName)
    3. Universal alias lookup in topic_category_aliases (where provider IS NULL)
    4. If no match found: logs the unresolved slug to unmapped_category_log for audit and skips extraction

    The topic_category_aliases table allows external provider slugs (e.g., GNews's "business" or TheNewsAPI's "tech") to map to Eko's internal taxonomy without requiring 1:1 slug naming. The unmapped_category_log provides visibility into coverage gaps.

  3. Loads the fact_record_schemas for that topic — this defines which JSONB keys the AI must produce (e.g., for a sports topic: team, score, opponent, date)

  4. Concatenates article titles + descriptions into article texts

  5. Calls extractFactsFromStory() from @eko/ai, which returns:

    • title — factual, Wikipedia-style label
    • challengeTitle — theatrical, curiosity-provoking hook
    • facts — structured JSONB conforming to the schema
    • context — 4-8 sentence narrative (Hook → Story → Connection)
    • notabilityScore — 0.0-1.0 AI assessment
    • notabilityReason — one-sentence justification
  6. If notabilityScore ≥ 0.6 (configurable via NOTABILITY_THRESHOLD):

    • Inserts a fact_records row with status: 'pending_validation'
    • Sets expires_at to 30 days from now (news facts are perishable)
    • Enqueues VALIDATE_FACT with strategy multi_source

The Notability Gate

The 0.6 threshold is the pipeline's first quality filter. Facts below this score are logged and discarded — they represent mundane or low-signal news that wouldn't make a good challenge. The threshold is configurable via NOTABILITY_THRESHOLD in environment config.


Phase 4: Fact Validation

Queue Message: VALIDATE_FACT

{
  type: "VALIDATE_FACT",
  payload: {
    fact_record_id: "uuid",
    strategy: "multi_source",    // for news-extracted facts
    retry_count: 0
  }
}

Worker: worker-validate

Validation Strategy Selection

The strategy is chosen by source type at enqueue time:

Source TypeStrategyConfidence Range
news_extractionmulti_source0.7-0.9 (based on source count)
api_importauthoritative_api0.95 (trusted pass-through)
ai_generated / file_seedai_cross_check0.0-1.0 (AI plausibility check)
Curated databasescurated_database0.95 (trusted pass-through)

Multi-Source Validation (News Path)

For news-extracted facts, confidence scales with how many independent sources corroborate the story:

  • 5+ sources → confidence 0.9
  • 3-4 sources → confidence 0.8
  • 1-2 sources → confidence 0.7

Pass/Fail Criteria

A fact is validated when:

  • confidence ≥ 0.7 AND
  • No flags containing "critical"

A validated fact gets:

  • status updated to 'validated'
  • published_at set to NOW()
  • validation JSONB populated with strategy, sources, confidence, flags

A rejected fact gets status: 'rejected' and is excluded from the feed.

Post-Validation Fan-Out

On validation success, two independent queue messages are enqueued in parallel:

  1. RESOLVE_IMAGE — find a suitable image for the fact
  2. GENERATE_CHALLENGE_CONTENT — pre-generate challenge variants

These are independent: they write to different columns/tables and have no shared state.


Phase 5a: Image Resolution

Queue Message: RESOLVE_IMAGE

{
  type: "RESOLVE_IMAGE",
  payload: {
    fact_record_id: "uuid",
    title: "SpaceX Starship Mars Launch",
    topic_path: "science/space"
  }
}

Worker: worker-ingest

The Cascade

apps/worker-ingest/src/handlers/resolve-image.ts runs a priority cascade, stopping at the first hit:

PrioritySourceHow It WorksBest For
1Wikipedia PageImagesMediaWiki API thumbnail lookup by entity nameNamed entities (~80% hit rate)
2TheSportsDBTeam/athlete badge/logo lookup (sports topics only)Sports teams, athletes
3UnsplashPhoto search by title keywordsAbstract/topical subjects
4PexelsPhoto search by title keywordsFallback photo pool
5nullUI renders a themed placeholderWhen nothing matches

Wikipedia is tried first because facts are entity-centric, and Wikipedia covers most notable entities. The topic_path is used to gate TheSportsDB (only tried if the path contains "sport").

Idempotency

If the fact already has an image_url, the handler exits early without making any API calls.


Phase 5b: Challenge Content Generation

Queue Message: GENERATE_CHALLENGE_CONTENT

{
  type: "GENERATE_CHALLENGE_CONTENT",
  payload: {
    fact_record_id: "uuid",
    difficulty: 2              // 1-5 scale
  }
}

Worker: worker-facts

What Gets Generated

apps/worker-facts/src/handlers/generate-challenge-content.ts calls generateChallengeContent() from @eko/ai, which produces content for each of the 6 pre-generated styles:

StyleUI PatternExample
multiple_choicePick from A/B/C/D"Which planet did SpaceX target?"
direct_questionAnswer a specific question"Where is Starship heading?"
fill_the_gapComplete a sentence"SpaceX launched Starship to ___"
statement_blankFill in a statement"___ launched the first crewed Mars mission"
reverse_lookupIdentify from a description"Which company launched a Mars mission in 2026?"
free_textOpen-ended response"Why is the Starship Mars launch significant?"

Two styles are exempt from pre-generation:

  • conversational — generated in real-time during multi-turn dialogue
  • progressive_image_reveal — requires runtime image processing

The 5-Layer Structure

Every generated style includes:

  1. setup_text — 2-4 sentences of freely shared context (the "offer knowledge" layer)
  2. challenge_text — an invitation to answer, always using second-person address ("you")
  3. reveal_correct — celebrates the user's knowledge with an additional teaching detail
  4. reveal_wrong — teaches without punishing, includes the correct answer and context
  5. correct_answer — 3-6 sentence rich narrative for animated streaming display

Results are upserted into fact_challenge_content with a unique constraint on (fact_record_id, challenge_style, target_fact_key, difficulty).

Quality Enforcement

Content is validated at generation time against the Challenge Content Rules (CC and CQ rules). Key checks:

  • No banned patterns ("Trivia", "Quiz", "Correct!", "Wrong!")
  • setup_text must contain specific details (names, dates, numbers)
  • challenge_text must contain "you" or "your"
  • correct_answer must be 100+ characters with narrative depth
  • Multiple choice must have exactly 4 plausible options

Phase 6: Feed Delivery

Once a fact has:

  • status: 'validated'
  • published_at set
  • Challenge content rows in fact_challenge_content (for ≥ 3 styles)
  • An image (or the UI uses a placeholder)

...it enters the feed algorithm.

Feed Blending (Authenticated Users)

StreamWeightDescription
Recent validated40%Newly published facts, freshest content
Review-due30%Facts where next_review_at has passed (spaced repetition)
Evergreen20%Timeless knowledge facts with no expiry
Exploration10%Random facts for discovery across unfamiliar topics

Unauthenticated users see a chronological-only feed.

Cards are interleaved round-robin and tagged with a userStatus badge: new, attempted, due, or mastered (based on streak ≥ 5).


Alternative Entry Paths

The pipeline above covers the news path — articles from external APIs. Two other paths feed into the same fact_records table:

Evergreen Generation

GENERATE_EVERGREEN messages produce timeless knowledge facts via AI for each topic category. These facts:

  • Have no expires_at (they never expire)
  • Use ai_cross_check validation (no news sources to corroborate)
  • Are deduped against existing fact titles for the topic
  • Are quota-controlled: EVERGREEN_DAILY_QUOTA per day, balanced across categories

Seed Pipeline (Bulk Import)

The seed pipeline is for manually curated content:

  1. EXPLODE_CATEGORY_ENTRY — takes a seed entry (e.g., "Prince") and AI-generates multiple structured facts, discovers spin-off entities, and identifies super-fact candidates
  2. IMPORT_FACTS — bulk-imports structured facts from any source (seed files, external APIs, manual entry)

Both paths converge at fact_records with status: 'pending_validation', then follow the same validation → image → challenge content pipeline as news-extracted facts.


Queue Architecture

All queue messages flow through Upstash Redis, managed by packages/queue/src/index.ts.

Queue Names

Each message type has a dedicated queue (with optional soak-test suffix):

queue:ingest_news
queue:cluster_stories
queue:extract_facts
queue:import_facts
queue:validate_fact
queue:generate_evergreen
queue:resolve_image
queue:resolve_challenge_image
queue:explode_category_entry
queue:find_super_facts
queue:generate_challenge_content

Failure Handling

  • Messages are retried up to 3 times (MAX_ATTEMPTS)
  • Failed messages are moved to a dead-letter queue (:dlq suffix)
  • Each message has a 2-minute lease duration to prevent double-processing
  • Workers use exponential backoff polling and scale-to-zero idle exit

Worker Assignment

WorkerQueues Consumed
worker-ingestINGEST_NEWS, CLUSTER_STORIES, RESOLVE_IMAGE, RESOLVE_CHALLENGE_IMAGE
worker-factsEXTRACT_FACTS, IMPORT_FACTS, GENERATE_EVERGREEN, EXPLODE_CATEGORY_ENTRY, FIND_SUPER_FACTS, GENERATE_CHALLENGE_CONTENT
worker-validateVALIDATE_FACT

Timing: End-to-End Latency

For a breaking news story, the typical path is:

PhaseTypical DurationBottleneck
Cron → INGEST_NEWS0-15 minCron interval
Fetch articles2-5 secAPI response time
Clustering1-3 secTF-IDF computation
AI extraction5-15 secLLM inference
Validation< 1 secMulti-source is synchronous
Image resolution2-10 secWikipedia API (first try)
Challenge generation10-30 secLLM inference for 6 styles
Total~1-16 minutesDominated by cron interval

Once a fact hits the feed, it's available to users within the next feed refresh.


Key Files Reference

FileRole in Pipeline
apps/web/app/api/cron/ingest-news/route.tsCron trigger, provider detection, queue dispatch
apps/worker-ingest/src/handlers/ingest-news.tsProvider clients (5 providers), StandardArticle normalization, quality filter, dedup
apps/worker-ingest/src/handlers/resolve-image.tsImage cascade for fact records (Wikipedia → SportsDB → Unsplash → Pexels)
apps/worker-ingest/src/handlers/resolve-challenge-image.tsImage cascade for challenge content rows
apps/worker-facts/src/handlers/extract-facts.tsAI fact extraction, notability gating
apps/worker-facts/src/handlers/generate-evergreen.tsAI evergreen fact generation
apps/worker-facts/src/handlers/generate-challenge-content.tsAI challenge content per style
apps/worker-facts/src/handlers/explode-entry.tsSeed entry explosion, spinoff discovery
apps/worker-facts/src/handlers/import-facts.tsBulk fact import, validation strategy selection
apps/worker-validate/src/handlers/validate-fact.tsTiered validation, post-validation fan-out
packages/queue/src/index.tsQueue client, message constructors, DLQ routing
packages/shared/src/schemas.tsZod schemas for all queue message types
packages/config/src/index.tsFactEngineConfig, API key management, thresholds
packages/db/src/drizzle/schema.tsTable definitions (news_sources, stories, fact_records, etc.)
packages/db/src/drizzle/fact-engine-queries.tsPipeline query functions
packages/ai/src/challenge-content-rules.tsContent quality validation, banned patterns
packages/ai/src/challenge-content.tsAI challenge content generation function