News-to-Challenge Ingestion Guide
A step-by-step walkthrough of how a breaking news article becomes a playable challenge card in Eko. This guide traces the full journey through the pipeline — from external news API to the user's feed.
The Journey at a Glance
Breaking news published "SpaceX launches Starship to Mars"
│
▼
┌─ PHASE 1: INGEST ──────────────────────────────────────────┐
│ Cron fires every 15 min → fetches from up to 5 news APIs │
│ Articles normalized to StandardArticle, quality-filtered │
│ (≥400 chars), deduped by hash, inserted into news_sources │
└──────────────────────────────────────────────┬──────────────┘
│ ≥ 5 new articles?
▼
┌─ PHASE 2: CLUSTER ─────────────────────────────────────────┐
│ TF-IDF cosine similarity within 24h window │
│ Multiple articles about the same event → one story │
│ Prevents duplicate fact extraction │
└──────────────────────────────────────────────┬──────────────┘
│
▼
┌─ PHASE 3: EXTRACT ─────────────────────────────────────────┐
│ AI reads story + linked articles │
│ Produces: title, facts{}, context, challenge_title │
│ Notability score ≥ 0.6 → insert fact_record │
└──────────────────────────────────────────────┬──────────────┘
│
▼
┌─ PHASE 4: VALIDATE ────────────────────────────────────────┐
│ Strategy picked by source type (multi_source for news) │
│ Confidence ≥ 0.7 + no critical flags → validated │
│ published_at set, RESOLVE_IMAGE + GENERATE_CHALLENGE fired │
└──────────────────────────────────────────────┬──────────────┘
│ │
▼ ▼
┌─ PHASE 5a: IMAGE ──────┐ ┌─ PHASE 5b: CHALLENGE CONTENT ┐
│ Wikipedia → SportsDB │ │ AI generates 3-5 styles: │
│ → Unsplash → Pexels │ │ MC, FTG, DQ, SB, RL, FT │
│ → null (placeholder) │ │ Each with 5-layer structure │
└─────────────────────────┘ └──────────────────────────────┘
│ │
└───────────────┬───────────────────────┘
▼
┌─ PHASE 6: FEED ────────────────────────────────────────────┐
│ Blended algorithm: 40% recent, 30% review-due, │
│ 20% evergreen, 10% exploration │
│ User sees challenge card → plays it → spaced repetition │
└────────────────────────────────────────────────────────────┘
Phase 1: News Ingestion
Trigger
A Vercel cron job fires every 15 minutes, hitting apps/web/app/api/cron/ingest-news/route.ts. The route is auth-gated via CRON_SECRET (checked in both Authorization and x-vercel-cron-authorization headers).
Provider Detection
The cron checks which news API keys are configured, prioritizing v2 full-content providers:
| Env Variable | Provider | Queue Value | Content |
|---|---|---|---|
EVENT_REGISTRY_API_KEY | Event Registry | event_registry | Full body (v2) |
NEWSDATA_API_KEY | Newsdata.io | newsdata | Full body on paid tiers (v2) |
NEWS_API_KEY | NewsAPI.org | newsapi | Truncated (legacy) |
GOOGLE_NEWS_API_KEY | GNews | gnews | Truncated (legacy) |
THENEWS_API_KEY | TheNewsAPI | thenewsapi | Truncated (legacy) |
For each configured provider, it fetches active topic categories from the database using getActiveTopicCategories({ maxDepth: 0 }) and enqueues one INGEST_NEWS message per provider per category. The maxDepth: 0 filter restricts queries to root-level categories only, preventing quota explosion when subcategories are added to the taxonomy.
Queue Message: INGEST_NEWS
{
type: "INGEST_NEWS",
payload: {
provider: "event_registry" | "newsdata" | "newsapi" | "gnews" | "thenewsapi",
category: "science", // topic category slug
max_results: 20
}
}
Worker: worker-ingest
What the Handler Does
apps/worker-ingest/src/handlers/ingest-news.ts:
- Routes to the correct provider client (
fetchFromEventRegistry,fetchFromNewsdata,fetchFromNewsApi,fetchFromGNews,fetchFromTheNewsApi) - Each provider normalizes its response to the StandardArticle contract:
externalId— stable hash of the article URL (or native UUID for TheNewsAPI)contentHash— Bun wyhash of article content for cross-provider deduptitle,description,articleUrl,imageUrl,publishedAt,sourceName,sourceDomainfullContent— full article body (v2 providers only; null for legacy)
- Articles are filtered by MIN_ARTICLE_TEXT_LENGTH (400 chars, measured as
title + descriptionlength). Articles below this threshold are discarded to prevent LLM hallucination from thin snippets - Qualifying articles are bulk-inserted into
news_sourceswithON CONFLICT DO NOTHINGon(provider, external_id), which silently skips duplicates - If
≥ 5new articles were inserted, aCLUSTER_STORIESmessage is enqueued
Observability
Every handler call creates an ingestion_runs record tracking status (completed/failed), counts (recordsProcessed, recordsCreated), and timing. Metrics are emitted via @eko/observability (ingest.news_sources_inserted, ingest.news_sources_duplicates).
Phase 2: Story Clustering
Queue Message: CLUSTER_STORIES
{
type: "CLUSTER_STORIES",
payload: {
news_source_ids: ["uuid-1", "uuid-2", ...],
time_window_hours: 24
}
}
Worker: worker-ingest
What Happens
Articles about the same real-world event are grouped into a single story using TF-IDF cosine similarity on article titles and descriptions, within a 24-hour time window.
- Each story gets a
headline,summary,source_count, andsource_domains[] - The
source_countis critical later: it determines validation confidence during the multi-source strategy - Stories progress through statuses:
clustering→published→archived
Why Clustering Matters
Without clustering, "SpaceX Starship Launches" reported by 15 outlets would produce 15 near-identical facts. Clustering deduplicates at the story level so AI extraction runs once per event, not once per article.
Phase 3: Fact Extraction
Queue Message: EXTRACT_FACTS
{
type: "EXTRACT_FACTS",
payload: {
story_id: "uuid",
topic_category_id: "uuid" // optional, resolved from story if absent
}
}
Worker: worker-facts
What the Handler Does
apps/worker-facts/src/handlers/extract-facts.ts:
-
Fetches the story with all its linked
news_sources -
Resolves the topic category using
resolveTopicCategory(), a 3-step alias fallback:- Exact slug match against
topic_categories.slug - Provider-specific alias lookup in
topic_category_aliases(whereprovider = providerName) - Universal alias lookup in
topic_category_aliases(whereprovider IS NULL) - If no match found: logs the unresolved slug to
unmapped_category_logfor audit and skips extraction
The
topic_category_aliasestable allows external provider slugs (e.g., GNews's "business" or TheNewsAPI's "tech") to map to Eko's internal taxonomy without requiring 1:1 slug naming. Theunmapped_category_logprovides visibility into coverage gaps. - Exact slug match against
-
Loads the
fact_record_schemasfor that topic — this defines which JSONB keys the AI must produce (e.g., for a sports topic:team,score,opponent,date) -
Concatenates article titles + descriptions into article texts
-
Calls
extractFactsFromStory()from@eko/ai, which returns:title— factual, Wikipedia-style labelchallengeTitle— theatrical, curiosity-provoking hookfacts— structured JSONB conforming to the schemacontext— 4-8 sentence narrative (Hook → Story → Connection)notabilityScore— 0.0-1.0 AI assessmentnotabilityReason— one-sentence justification
-
If
notabilityScore ≥ 0.6(configurable viaNOTABILITY_THRESHOLD):- Inserts a
fact_recordsrow withstatus: 'pending_validation' - Sets
expires_atto 30 days from now (news facts are perishable) - Enqueues
VALIDATE_FACTwith strategymulti_source
- Inserts a
The Notability Gate
The 0.6 threshold is the pipeline's first quality filter. Facts below this score are logged and discarded — they represent mundane or low-signal news that wouldn't make a good challenge. The threshold is configurable via NOTABILITY_THRESHOLD in environment config.
Phase 4: Fact Validation
Queue Message: VALIDATE_FACT
{
type: "VALIDATE_FACT",
payload: {
fact_record_id: "uuid",
strategy: "multi_source", // for news-extracted facts
retry_count: 0
}
}
Worker: worker-validate
Validation Strategy Selection
The strategy is chosen by source type at enqueue time:
| Source Type | Strategy | Confidence Range |
|---|---|---|
news_extraction | multi_source | 0.7-0.9 (based on source count) |
api_import | authoritative_api | 0.95 (trusted pass-through) |
ai_generated / file_seed | ai_cross_check | 0.0-1.0 (AI plausibility check) |
| Curated databases | curated_database | 0.95 (trusted pass-through) |
Multi-Source Validation (News Path)
For news-extracted facts, confidence scales with how many independent sources corroborate the story:
- 5+ sources → confidence 0.9
- 3-4 sources → confidence 0.8
- 1-2 sources → confidence 0.7
Pass/Fail Criteria
A fact is validated when:
confidence ≥ 0.7AND- No flags containing "critical"
A validated fact gets:
statusupdated to'validated'published_atset toNOW()validationJSONB populated with strategy, sources, confidence, flags
A rejected fact gets status: 'rejected' and is excluded from the feed.
Post-Validation Fan-Out
On validation success, two independent queue messages are enqueued in parallel:
RESOLVE_IMAGE— find a suitable image for the factGENERATE_CHALLENGE_CONTENT— pre-generate challenge variants
These are independent: they write to different columns/tables and have no shared state.
Phase 5a: Image Resolution
Queue Message: RESOLVE_IMAGE
{
type: "RESOLVE_IMAGE",
payload: {
fact_record_id: "uuid",
title: "SpaceX Starship Mars Launch",
topic_path: "science/space"
}
}
Worker: worker-ingest
The Cascade
apps/worker-ingest/src/handlers/resolve-image.ts runs a priority cascade, stopping at the first hit:
| Priority | Source | How It Works | Best For |
|---|---|---|---|
| 1 | Wikipedia PageImages | MediaWiki API thumbnail lookup by entity name | Named entities (~80% hit rate) |
| 2 | TheSportsDB | Team/athlete badge/logo lookup (sports topics only) | Sports teams, athletes |
| 3 | Unsplash | Photo search by title keywords | Abstract/topical subjects |
| 4 | Pexels | Photo search by title keywords | Fallback photo pool |
| 5 | null | UI renders a themed placeholder | When nothing matches |
Wikipedia is tried first because facts are entity-centric, and Wikipedia covers most notable entities. The topic_path is used to gate TheSportsDB (only tried if the path contains "sport").
Idempotency
If the fact already has an image_url, the handler exits early without making any API calls.
Phase 5b: Challenge Content Generation
Queue Message: GENERATE_CHALLENGE_CONTENT
{
type: "GENERATE_CHALLENGE_CONTENT",
payload: {
fact_record_id: "uuid",
difficulty: 2 // 1-5 scale
}
}
Worker: worker-facts
What Gets Generated
apps/worker-facts/src/handlers/generate-challenge-content.ts calls generateChallengeContent() from @eko/ai, which produces content for each of the 6 pre-generated styles:
| Style | UI Pattern | Example |
|---|---|---|
multiple_choice | Pick from A/B/C/D | "Which planet did SpaceX target?" |
direct_question | Answer a specific question | "Where is Starship heading?" |
fill_the_gap | Complete a sentence | "SpaceX launched Starship to ___" |
statement_blank | Fill in a statement | "___ launched the first crewed Mars mission" |
reverse_lookup | Identify from a description | "Which company launched a Mars mission in 2026?" |
free_text | Open-ended response | "Why is the Starship Mars launch significant?" |
Two styles are exempt from pre-generation:
conversational— generated in real-time during multi-turn dialogueprogressive_image_reveal— requires runtime image processing
The 5-Layer Structure
Every generated style includes:
setup_text— 2-4 sentences of freely shared context (the "offer knowledge" layer)challenge_text— an invitation to answer, always using second-person address ("you")reveal_correct— celebrates the user's knowledge with an additional teaching detailreveal_wrong— teaches without punishing, includes the correct answer and contextcorrect_answer— 3-6 sentence rich narrative for animated streaming display
Results are upserted into fact_challenge_content with a unique constraint on (fact_record_id, challenge_style, target_fact_key, difficulty).
Quality Enforcement
Content is validated at generation time against the Challenge Content Rules (CC and CQ rules). Key checks:
- No banned patterns ("Trivia", "Quiz", "Correct!", "Wrong!")
setup_textmust contain specific details (names, dates, numbers)challenge_textmust contain "you" or "your"correct_answermust be 100+ characters with narrative depth- Multiple choice must have exactly 4 plausible options
Phase 6: Feed Delivery
Once a fact has:
status: 'validated'published_atset- Challenge content rows in
fact_challenge_content(for ≥ 3 styles) - An image (or the UI uses a placeholder)
...it enters the feed algorithm.
Feed Blending (Authenticated Users)
| Stream | Weight | Description |
|---|---|---|
| Recent validated | 40% | Newly published facts, freshest content |
| Review-due | 30% | Facts where next_review_at has passed (spaced repetition) |
| Evergreen | 20% | Timeless knowledge facts with no expiry |
| Exploration | 10% | Random facts for discovery across unfamiliar topics |
Unauthenticated users see a chronological-only feed.
Cards are interleaved round-robin and tagged with a userStatus badge: new, attempted, due, or mastered (based on streak ≥ 5).
Alternative Entry Paths
The pipeline above covers the news path — articles from external APIs. Two other paths feed into the same fact_records table:
Evergreen Generation
GENERATE_EVERGREEN messages produce timeless knowledge facts via AI for each topic category. These facts:
- Have no
expires_at(they never expire) - Use
ai_cross_checkvalidation (no news sources to corroborate) - Are deduped against existing fact titles for the topic
- Are quota-controlled:
EVERGREEN_DAILY_QUOTAper day, balanced across categories
Seed Pipeline (Bulk Import)
The seed pipeline is for manually curated content:
EXPLODE_CATEGORY_ENTRY— takes a seed entry (e.g., "Prince") and AI-generates multiple structured facts, discovers spin-off entities, and identifies super-fact candidatesIMPORT_FACTS— bulk-imports structured facts from any source (seed files, external APIs, manual entry)
Both paths converge at fact_records with status: 'pending_validation', then follow the same validation → image → challenge content pipeline as news-extracted facts.
Queue Architecture
All queue messages flow through Upstash Redis, managed by packages/queue/src/index.ts.
Queue Names
Each message type has a dedicated queue (with optional soak-test suffix):
queue:ingest_news
queue:cluster_stories
queue:extract_facts
queue:import_facts
queue:validate_fact
queue:generate_evergreen
queue:resolve_image
queue:resolve_challenge_image
queue:explode_category_entry
queue:find_super_facts
queue:generate_challenge_content
Failure Handling
- Messages are retried up to 3 times (
MAX_ATTEMPTS) - Failed messages are moved to a dead-letter queue (
:dlqsuffix) - Each message has a 2-minute lease duration to prevent double-processing
- Workers use exponential backoff polling and scale-to-zero idle exit
Worker Assignment
| Worker | Queues Consumed |
|---|---|
worker-ingest | INGEST_NEWS, CLUSTER_STORIES, RESOLVE_IMAGE, RESOLVE_CHALLENGE_IMAGE |
worker-facts | EXTRACT_FACTS, IMPORT_FACTS, GENERATE_EVERGREEN, EXPLODE_CATEGORY_ENTRY, FIND_SUPER_FACTS, GENERATE_CHALLENGE_CONTENT |
worker-validate | VALIDATE_FACT |
Timing: End-to-End Latency
For a breaking news story, the typical path is:
| Phase | Typical Duration | Bottleneck |
|---|---|---|
| Cron → INGEST_NEWS | 0-15 min | Cron interval |
| Fetch articles | 2-5 sec | API response time |
| Clustering | 1-3 sec | TF-IDF computation |
| AI extraction | 5-15 sec | LLM inference |
| Validation | < 1 sec | Multi-source is synchronous |
| Image resolution | 2-10 sec | Wikipedia API (first try) |
| Challenge generation | 10-30 sec | LLM inference for 6 styles |
| Total | ~1-16 minutes | Dominated by cron interval |
Once a fact hits the feed, it's available to users within the next feed refresh.
Key Files Reference
| File | Role in Pipeline |
|---|---|
apps/web/app/api/cron/ingest-news/route.ts | Cron trigger, provider detection, queue dispatch |
apps/worker-ingest/src/handlers/ingest-news.ts | Provider clients (5 providers), StandardArticle normalization, quality filter, dedup |
apps/worker-ingest/src/handlers/resolve-image.ts | Image cascade for fact records (Wikipedia → SportsDB → Unsplash → Pexels) |
apps/worker-ingest/src/handlers/resolve-challenge-image.ts | Image cascade for challenge content rows |
apps/worker-facts/src/handlers/extract-facts.ts | AI fact extraction, notability gating |
apps/worker-facts/src/handlers/generate-evergreen.ts | AI evergreen fact generation |
apps/worker-facts/src/handlers/generate-challenge-content.ts | AI challenge content per style |
apps/worker-facts/src/handlers/explode-entry.ts | Seed entry explosion, spinoff discovery |
apps/worker-facts/src/handlers/import-facts.ts | Bulk fact import, validation strategy selection |
apps/worker-validate/src/handlers/validate-fact.ts | Tiered validation, post-validation fan-out |
packages/queue/src/index.ts | Queue client, message constructors, DLQ routing |
packages/shared/src/schemas.ts | Zod schemas for all queue message types |
packages/config/src/index.ts | FactEngineConfig, API key management, thresholds |
packages/db/src/drizzle/schema.ts | Table definitions (news_sources, stories, fact_records, etc.) |
packages/db/src/drizzle/fact-engine-queries.ts | Pipeline query functions |
packages/ai/src/challenge-content-rules.ts | Content quality validation, banned patterns |
packages/ai/src/challenge-content.ts | AI challenge content generation function |
Related Documents
- News & Fact Engine — System reference (providers, costs, config)
- Fact-Challenge Anatomy — How facts become challenges (6 concepts, 5 layers)
- Challenge Content Rules — Quality rules for generated content
- Table Flow Diagram — Visual table relationships