News-to-Challenge Ingestion Guide

A step-by-step walkthrough of how a breaking news article becomes a playable challenge card in Eko. This guide traces the full journey through the pipeline — from external news API to the user's feed.

The Journey at a Glance

  Breaking news published    "SpaceX launches Starship to Mars"
         │
         ▼
  ┌─ PHASE 1: INGEST ──────────────────────────────────────────┐
  │  Cron fires every 15 min → fetches from up to 5 news APIs  │
  │  Articles normalized to StandardArticle, quality-filtered   │
  │  (≥400 chars), deduped by hash, inserted into news_sources  │
  └──────────────────────────────────────────────┬──────────────┘
         │  ≥ 5 new articles?
         ▼
  ┌─ PHASE 2: CLUSTER ─────────────────────────────────────────┐
  │  TF-IDF cosine similarity within 24h window                 │
  │  Multiple articles about the same event → one story         │
  │  Prevents duplicate fact extraction                         │
  └──────────────────────────────────────────────┬──────────────┘
         │
         ▼
  ┌─ PHASE 3: EXTRACT ─────────────────────────────────────────┐
  │  AI reads story + linked articles                           │
  │  Produces: title, facts{}, context, challenge_title         │
  │  Notability score ≥ 0.6 → insert fact_record               │
  └──────────────────────────────────────────────┬──────────────┘
         │
         ▼
  ┌─ PHASE 4: VALIDATE ────────────────────────────────────────┐
  │  Strategy picked by source type (multi_source for news)     │
  │  Confidence ≥ 0.7 + no critical flags → validated           │
  │  published_at set, RESOLVE_IMAGE + GENERATE_CHALLENGE fired │
  └──────────────────────────────────────────────┬──────────────┘
         │                                       │
         ▼                                       ▼
  ┌─ PHASE 5a: IMAGE ──────┐   ┌─ PHASE 5b: CHALLENGE CONTENT ┐
  │  Wikipedia → SportsDB   │   │  AI generates 3-5 styles:     │
  │  → Unsplash → Pexels    │   │  MC, FTG, DQ, SB, RL, FT     │
  │  → null (placeholder)   │   │  Each with 5-layer structure   │
  └─────────────────────────┘   └──────────────────────────────┘
         │                                       │
         └───────────────┬───────────────────────┘
                         ▼
  ┌─ PHASE 6: FEED ────────────────────────────────────────────┐
  │  Blended algorithm: 40% recent, 30% review-due,            │
  │  20% evergreen, 10% exploration                            │
  │  User sees challenge card → plays it → spaced repetition   │
  └────────────────────────────────────────────────────────────┘

Phase 1: News Ingestion

Trigger

A Vercel cron job fires every 15 minutes, hitting apps/web/app/api/cron/ingest-news/route.ts. The route is auth-gated via CRON_SECRET (checked in both Authorization and x-vercel-cron-authorization headers).

Provider Detection

The cron checks which news API keys are configured, prioritizing v2 full-content providers:

Env Variable	Provider	Queue Value	Content
`EVENT_REGISTRY_API_KEY`	Event Registry	`event_registry`	Full body (v2)
`NEWSDATA_API_KEY`	Newsdata.io	`newsdata`	Full body on paid tiers (v2)
`NEWS_API_KEY`	NewsAPI.org	`newsapi`	Truncated (legacy)
`GOOGLE_NEWS_API_KEY`	GNews	`gnews`	Truncated (legacy)
`THENEWS_API_KEY`	TheNewsAPI	`thenewsapi`	Truncated (legacy)

For each configured provider, it fetches active topic categories from the database using getActiveTopicCategories({ maxDepth: 0 }) and enqueues one INGEST_NEWS message per provider per category. The maxDepth: 0 filter restricts queries to root-level categories only, preventing quota explosion when subcategories are added to the taxonomy.

Queue Message: `INGEST_NEWS`

{
  type: "INGEST_NEWS",
  payload: {
    provider: "event_registry" | "newsdata" | "newsapi" | "gnews" | "thenewsapi",
    category: "science",     // topic category slug
    max_results: 20
  }
}

Worker: worker-ingest

What the Handler Does

apps/worker-ingest/src/handlers/ingest-news.ts:

Routes to the correct provider client (fetchFromEventRegistry, fetchFromNewsdata, fetchFromNewsApi, fetchFromGNews, fetchFromTheNewsApi)
Each provider normalizes its response to the StandardArticle contract:
- externalId — stable hash of the article URL (or native UUID for TheNewsAPI)
- contentHash — Bun wyhash of article content for cross-provider dedup
- title, description, articleUrl, imageUrl, publishedAt, sourceName, sourceDomain
- fullContent — full article body (v2 providers only; null for legacy)
Articles are filtered by MIN_ARTICLE_TEXT_LENGTH (400 chars, measured as title + description length). Articles below this threshold are discarded to prevent LLM hallucination from thin snippets
Qualifying articles are bulk-inserted into news_sources with ON CONFLICT DO NOTHING on (provider, external_id), which silently skips duplicates
If ≥ 5 new articles were inserted, a CLUSTER_STORIES message is enqueued

Observability

Every handler call creates an ingestion_runs record tracking status (completed/failed), counts (recordsProcessed, recordsCreated), and timing. Metrics are emitted via @eko/observability (ingest.news_sources_inserted, ingest.news_sources_duplicates).

Phase 2: Story Clustering

Queue Message: `CLUSTER_STORIES`

{
  type: "CLUSTER_STORIES",
  payload: {
    news_source_ids: ["uuid-1", "uuid-2", ...],
    time_window_hours: 24
  }
}

Worker: worker-ingest

What Happens

Articles about the same real-world event are grouped into a single story using TF-IDF cosine similarity on article titles and descriptions, within a 24-hour time window.

Each story gets a headline, summary, source_count, and source_domains[]
The source_count is critical later: it determines validation confidence during the multi-source strategy
Stories progress through statuses: clustering → published → archived

Why Clustering Matters

Without clustering, "SpaceX Starship Launches" reported by 15 outlets would produce 15 near-identical facts. Clustering deduplicates at the story level so AI extraction runs once per event, not once per article.

Phase 3: Fact Extraction

Queue Message: `EXTRACT_FACTS`

{
  type: "EXTRACT_FACTS",
  payload: {
    story_id: "uuid",
    topic_category_id: "uuid"    // optional, resolved from story if absent
  }
}

Worker: worker-facts

What the Handler Does

apps/worker-facts/src/handlers/extract-facts.ts:

Fetches the story with all its linked news_sources
Resolves the topic category using resolveTopicCategory(), a 3-step alias fallback:
1. Exact slug match against topic_categories.slug
2. Provider-specific alias lookup in topic_category_aliases (where provider = providerName)
3. Universal alias lookup in topic_category_aliases (where provider IS NULL)
4. If no match found: logs the unresolved slug to unmapped_category_log for audit and skips extraction
The topic_category_aliases table allows external provider slugs (e.g., GNews's "business" or TheNewsAPI's "tech") to map to Eko's internal taxonomy without requiring 1:1 slug naming. The unmapped_category_log provides visibility into coverage gaps.
Loads the fact_record_schemas for that topic — this defines which JSONB keys the AI must produce (e.g., for a sports topic: team, score, opponent, date)
Concatenates article titles + descriptions into article texts
Calls extractFactsFromStory() from @eko/ai, which returns:
- title — factual, Wikipedia-style label
- challengeTitle — theatrical, curiosity-provoking hook
- facts — structured JSONB conforming to the schema
- context — 4-8 sentence narrative (Hook → Story → Connection)
- notabilityScore — 0.0-1.0 AI assessment
- notabilityReason — one-sentence justification
If notabilityScore ≥ 0.6 (configurable via NOTABILITY_THRESHOLD):
- Inserts a fact_records row with status: 'pending_validation'
- Sets expires_at to 30 days from now (news facts are perishable)
- Enqueues VALIDATE_FACT with strategy multi_source

The Notability Gate

The 0.6 threshold is the pipeline's first quality filter. Facts below this score are logged and discarded — they represent mundane or low-signal news that wouldn't make a good challenge. The threshold is configurable via NOTABILITY_THRESHOLD in environment config.

Phase 4: Fact Validation

Queue Message: `VALIDATE_FACT`

{
  type: "VALIDATE_FACT",
  payload: {
    fact_record_id: "uuid",
    strategy: "multi_source",    // for news-extracted facts
    retry_count: 0
  }
}

Worker: worker-validate

Validation Strategy Selection

The strategy is chosen by source type at enqueue time:

Source Type	Strategy	Confidence Range
`news_extraction`	`multi_source`	0.7-0.9 (based on source count)
`api_import`	`authoritative_api`	0.95 (trusted pass-through)
`ai_generated` / `file_seed`	`ai_cross_check`	0.0-1.0 (AI plausibility check)
Curated databases	`curated_database`	0.95 (trusted pass-through)

Multi-Source Validation (News Path)

For news-extracted facts, confidence scales with how many independent sources corroborate the story:

5+ sources → confidence 0.9
3-4 sources → confidence 0.8
1-2 sources → confidence 0.7

Pass/Fail Criteria

A fact is validated when:

confidence ≥ 0.7 AND
No flags containing "critical"

A validated fact gets:

status updated to 'validated'
published_at set to NOW()
validation JSONB populated with strategy, sources, confidence, flags

A rejected fact gets status: 'rejected' and is excluded from the feed.

Post-Validation Fan-Out

On validation success, two independent queue messages are enqueued in parallel:

RESOLVE_IMAGE — find a suitable image for the fact
GENERATE_CHALLENGE_CONTENT — pre-generate challenge variants

These are independent: they write to different columns/tables and have no shared state.

Phase 5a: Image Resolution

Queue Message: `RESOLVE_IMAGE`

{
  type: "RESOLVE_IMAGE",
  payload: {
    fact_record_id: "uuid",
    title: "SpaceX Starship Mars Launch",
    topic_path: "science/space"
  }
}

Worker: worker-ingest

The Cascade

apps/worker-ingest/src/handlers/resolve-image.ts runs a priority cascade, stopping at the first hit:

Priority	Source	How It Works	Best For
1	Wikipedia PageImages	MediaWiki API thumbnail lookup by entity name	Named entities (~80% hit rate)
2	TheSportsDB	Team/athlete badge/logo lookup (sports topics only)	Sports teams, athletes
3	Unsplash	Photo search by title keywords	Abstract/topical subjects
4	Pexels	Photo search by title keywords	Fallback photo pool
5	null	UI renders a themed placeholder	When nothing matches

Wikipedia is tried first because facts are entity-centric, and Wikipedia covers most notable entities. The topic_path is used to gate TheSportsDB (only tried if the path contains "sport").

Idempotency

If the fact already has an image_url, the handler exits early without making any API calls.

Phase 5b: Challenge Content Generation

Queue Message: `GENERATE_CHALLENGE_CONTENT`

{
  type: "GENERATE_CHALLENGE_CONTENT",
  payload: {
    fact_record_id: "uuid",
    difficulty: 2              // 1-5 scale
  }
}

Worker: worker-facts

What Gets Generated

apps/worker-facts/src/handlers/generate-challenge-content.ts calls generateChallengeContent() from @eko/ai, which produces content for each of the 6 pre-generated styles:

Style	UI Pattern	Example
`multiple_choice`	Pick from A/B/C/D	"Which planet did SpaceX target?"
`direct_question`	Answer a specific question	"Where is Starship heading?"
`fill_the_gap`	Complete a sentence	"SpaceX launched Starship to ___"
`statement_blank`	Fill in a statement	"___ launched the first crewed Mars mission"
`reverse_lookup`	Identify from a description	"Which company launched a Mars mission in 2026?"
`free_text`	Open-ended response	"Why is the Starship Mars launch significant?"

Two styles are exempt from pre-generation:

conversational — generated in real-time during multi-turn dialogue
progressive_image_reveal — requires runtime image processing

The 5-Layer Structure

Every generated style includes:

setup_text — 2-4 sentences of freely shared context (the "offer knowledge" layer)
challenge_text — an invitation to answer, always using second-person address ("you")
reveal_correct — celebrates the user's knowledge with an additional teaching detail
reveal_wrong — teaches without punishing, includes the correct answer and context
correct_answer — 3-6 sentence rich narrative for animated streaming display

Results are upserted into fact_challenge_content with a unique constraint on (fact_record_id, challenge_style, target_fact_key, difficulty).

Quality Enforcement

Content is validated at generation time against the Challenge Content Rules (CC and CQ rules). Key checks:

No banned patterns ("Trivia", "Quiz", "Correct!", "Wrong!")
setup_text must contain specific details (names, dates, numbers)
challenge_text must contain "you" or "your"
correct_answer must be 100+ characters with narrative depth
Multiple choice must have exactly 4 plausible options

Phase 6: Feed Delivery

Once a fact has:

status: 'validated'
published_at set
Challenge content rows in fact_challenge_content (for ≥ 3 styles)
An image (or the UI uses a placeholder)

...it enters the feed algorithm.

Feed Blending (Authenticated Users)

Stream	Weight	Description
Recent validated	40%	Newly published facts, freshest content
Review-due	30%	Facts where `next_review_at` has passed (spaced repetition)
Evergreen	20%	Timeless knowledge facts with no expiry
Exploration	10%	Random facts for discovery across unfamiliar topics

Unauthenticated users see a chronological-only feed.

Cards are interleaved round-robin and tagged with a userStatus badge: new, attempted, due, or mastered (based on streak ≥ 5).

Alternative Entry Paths

The pipeline above covers the news path — articles from external APIs. Two other paths feed into the same fact_records table:

Evergreen Generation

GENERATE_EVERGREEN messages produce timeless knowledge facts via AI for each topic category. These facts:

Have no expires_at (they never expire)
Use ai_cross_check validation (no news sources to corroborate)
Are deduped against existing fact titles for the topic
Are quota-controlled: EVERGREEN_DAILY_QUOTA per day, balanced across categories

Seed Pipeline (Bulk Import)

The seed pipeline is for manually curated content:

EXPLODE_CATEGORY_ENTRY — takes a seed entry (e.g., "Prince") and AI-generates multiple structured facts, discovers spin-off entities, and identifies super-fact candidates
IMPORT_FACTS — bulk-imports structured facts from any source (seed files, external APIs, manual entry)

Both paths converge at fact_records with status: 'pending_validation', then follow the same validation → image → challenge content pipeline as news-extracted facts.

Queue Architecture

All queue messages flow through Upstash Redis, managed by packages/queue/src/index.ts.

Queue Names

Each message type has a dedicated queue (with optional soak-test suffix):

queue:ingest_news
queue:cluster_stories
queue:extract_facts
queue:import_facts
queue:validate_fact
queue:generate_evergreen
queue:resolve_image
queue:resolve_challenge_image
queue:explode_category_entry
queue:find_super_facts
queue:generate_challenge_content

Failure Handling

Messages are retried up to 3 times (MAX_ATTEMPTS)
Failed messages are moved to a dead-letter queue (:dlq suffix)
Each message has a 2-minute lease duration to prevent double-processing
Workers use exponential backoff polling and scale-to-zero idle exit

Worker Assignment

Worker	Queues Consumed
`worker-ingest`	INGEST_NEWS, CLUSTER_STORIES, RESOLVE_IMAGE, RESOLVE_CHALLENGE_IMAGE
`worker-facts`	EXTRACT_FACTS, IMPORT_FACTS, GENERATE_EVERGREEN, EXPLODE_CATEGORY_ENTRY, FIND_SUPER_FACTS, GENERATE_CHALLENGE_CONTENT
`worker-validate`	VALIDATE_FACT

Timing: End-to-End Latency

For a breaking news story, the typical path is:

Phase	Typical Duration	Bottleneck
Cron → INGEST_NEWS	0-15 min	Cron interval
Fetch articles	2-5 sec	API response time
Clustering	1-3 sec	TF-IDF computation
AI extraction	5-15 sec	LLM inference
Validation	< 1 sec	Multi-source is synchronous
Image resolution	2-10 sec	Wikipedia API (first try)
Challenge generation	10-30 sec	LLM inference for 6 styles
Total	~1-16 minutes	Dominated by cron interval

Once a fact hits the feed, it's available to users within the next feed refresh.

Key Files Reference

File	Role in Pipeline
`apps/web/app/api/cron/ingest-news/route.ts`	Cron trigger, provider detection, queue dispatch
`apps/worker-ingest/src/handlers/ingest-news.ts`	Provider clients (5 providers), StandardArticle normalization, quality filter, dedup
`apps/worker-ingest/src/handlers/resolve-image.ts`	Image cascade for fact records (Wikipedia → SportsDB → Unsplash → Pexels)
`apps/worker-ingest/src/handlers/resolve-challenge-image.ts`	Image cascade for challenge content rows
`apps/worker-facts/src/handlers/extract-facts.ts`	AI fact extraction, notability gating
`apps/worker-facts/src/handlers/generate-evergreen.ts`	AI evergreen fact generation
`apps/worker-facts/src/handlers/generate-challenge-content.ts`	AI challenge content per style
`apps/worker-facts/src/handlers/explode-entry.ts`	Seed entry explosion, spinoff discovery
`apps/worker-facts/src/handlers/import-facts.ts`	Bulk fact import, validation strategy selection
`apps/worker-validate/src/handlers/validate-fact.ts`	Tiered validation, post-validation fan-out
`packages/queue/src/index.ts`	Queue client, message constructors, DLQ routing
`packages/shared/src/schemas.ts`	Zod schemas for all queue message types
`packages/config/src/index.ts`	FactEngineConfig, API key management, thresholds
`packages/db/src/drizzle/schema.ts`	Table definitions (news_sources, stories, fact_records, etc.)
`packages/db/src/drizzle/fact-engine-queries.ts`	Pipeline query functions
`packages/ai/src/challenge-content-rules.ts`	Content quality validation, banned patterns
`packages/ai/src/challenge-content.ts`	AI challenge content generation function

News & Fact Engine — System reference (providers, costs, config)
Fact-Challenge Anatomy — How facts become challenges (6 concepts, 5 layers)
Challenge Content Rules — Quality rules for generated content
Table Flow Diagram — Visual table relationships

#News-to-Challenge Ingestion Guide

#The Journey at a Glance

#Phase 1: News Ingestion

#Trigger

#Provider Detection

#Queue Message: INGEST_NEWS

#What the Handler Does

#Observability

#Phase 2: Story Clustering

#Queue Message: CLUSTER_STORIES

#What Happens

#Why Clustering Matters

#Phase 3: Fact Extraction

#Queue Message: EXTRACT_FACTS

#What the Handler Does

#The Notability Gate

#Phase 4: Fact Validation

#Queue Message: VALIDATE_FACT

#Validation Strategy Selection

#Multi-Source Validation (News Path)

#Pass/Fail Criteria

#Post-Validation Fan-Out

#Phase 5a: Image Resolution

#Queue Message: RESOLVE_IMAGE

#The Cascade

#Idempotency

#Phase 5b: Challenge Content Generation

#Queue Message: GENERATE_CHALLENGE_CONTENT

#What Gets Generated

#The 5-Layer Structure

#Quality Enforcement

#Phase 6: Feed Delivery

#Feed Blending (Authenticated Users)

#Alternative Entry Paths

#Evergreen Generation

#Seed Pipeline (Bulk Import)

#Queue Architecture

#Queue Names

#Failure Handling

#Worker Assignment

#Timing: End-to-End Latency

#Key Files Reference

#Related Documents

News-to-Challenge Ingestion Guide

The Journey at a Glance

Phase 1: News Ingestion

Trigger

Provider Detection

Queue Message: `INGEST_NEWS`

What the Handler Does

Observability

Phase 2: Story Clustering

Queue Message: `CLUSTER_STORIES`

What Happens

Why Clustering Matters

Phase 3: Fact Extraction

Queue Message: `EXTRACT_FACTS`

What the Handler Does

The Notability Gate

Phase 4: Fact Validation

Queue Message: `VALIDATE_FACT`

Validation Strategy Selection

Multi-Source Validation (News Path)

Pass/Fail Criteria

Post-Validation Fan-Out

Phase 5a: Image Resolution

Queue Message: `RESOLVE_IMAGE`

The Cascade

Idempotency

Phase 5b: Challenge Content Generation

Queue Message: `GENERATE_CHALLENGE_CONTENT`

What Gets Generated

The 5-Layer Structure

Quality Enforcement

Phase 6: Feed Delivery

Feed Blending (Authenticated Users)

Alternative Entry Paths

Evergreen Generation

Seed Pipeline (Bulk Import)

Queue Architecture

Queue Names

Failure Handling

Worker Assignment

Timing: End-to-End Latency

Key Files Reference

Related Documents