News & Fact Engine

How Eko turns the world's news into structured, verified facts delivered as interactive challenges.

What It Does

The fact engine is a multi-stage pipeline that:

Ingests articles from multiple news APIs every 15 minutes
Filters articles by content length to prevent LLM hallucination from thin snippets
Clusters related articles into stories using TF-IDF similarity
Extracts structured facts from stories using AI
Validates each fact through multi-tier verification
Generates challenge content (quiz, recall, free-text) per style
Resolves images through a priority cascade of free APIs
Publishes to a blended feed with spaced repetition scheduling

The result: a continuously refreshed stream of verified, image-backed, challenge-ready fact cards.

Pipeline Architecture

  News APIs                Seed Files
  (Event Registry,         (XLSX, CSV,
   Newsdata, NewsAPI,       AI super-facts)
   GNews, TheNewsAPI)
       │                        │
       ▼                        │
 ┌─────────────┐               │
 │ news_sources │               │
 │ (dedup via   │               │
 │  content_hash)               │
 │ + fullContent │              │
 └──────┬──────┘               │
        │ quality filter        │
        │ (≥400 chars)          │
        │ CLUSTER_STORIES       │
        ▼                       │
   ┌─────────┐                  │
   │ stories  │                  │
   └────┬────┘                  │
        │ EXTRACT_FACTS    IMPORT_FACTS
        ▼                       │
  ┌──────────────┐◄────────────┘
  │ fact_records  │
  │ (pending →    │──── RESOLVE_IMAGE ──► image cascade
  │  validated →  │──── VALIDATE_FACT ──► tiered verification
  │  published)   │──── GENERATE_CHALLENGE_CONTENT
  └──────┬───────┘
         │
         ▼
  ┌──────────────────────┐
  │ fact_challenge_content│  Pre-generated per style
  │ + RESOLVE_CHALLENGE   │  + per-challenge image resolution
  │   _IMAGE              │
  └──────────────────────┘
         │
         ▼
    User Feed (blended: 40% recent, 30% review-due, 20% evergreen, 10% explore)

News Providers

Five external news APIs feed the pipeline. V2 providers deliver full article bodies for higher-quality fact extraction; legacy providers return truncated snippets (~256 chars) and are kept for backwards compatibility.

V2 Providers (Full-Content)

Provider	API	Auth	Free Tier	Prod Tier
Event Registry	`newsapi.ai/api/v1`	Query: `apiKey`	2,000 tokens (one-time, ~200K articles)	$90/mo 5K plan
Newsdata.io	`newsdata.io/api/1`	Query: `apikey`	Title + description only	Paid tiers: full body

Event Registry is the primary v2 provider — it returns full article bodies (avg 3,000-5,000 chars) on its free tier, enabling significantly better fact extraction than truncated snippets.

Legacy Providers (Truncated)

Provider	API	Auth	Free Tier	Prod Tier
NewsAPI.org	`newsapi.org/v2`	Header: `X-Api-Key`	100 req/day, 24h delay	Business $449/mo
GNews	`gnews.io/api/v4`	Query: `apikey`	100 req/day, 12h delay	Essential ~$50/mo
TheNewsAPI	`thenewsapi.com/v1`	Query: `api_token`	100 req/day, 3 articles/req	Basic $19/mo

Provider Selection Logic

The cron route (apps/web/api/cron/ingest-news) checks which providers have API keys configured and enqueues one INGEST_NEWS message per provider per active root-level topic category (queried with maxDepth: 0 to prevent quota explosion when subcategories exist).

V2 providers are checked first, then legacy providers:

EVENT_REGISTRY_API_KEY present → enqueue event_registry
NEWSDATA_API_KEY present → enqueue newsdata
NEWS_API_KEY present → enqueue newsapi
GOOGLE_NEWS_API_KEY present → enqueue gnews
THENEWS_API_KEY present → enqueue thenewsapi

Quality Filter

All providers pass through a MIN_ARTICLE_TEXT_LENGTH filter (400 characters, measured as title.length + description.length). Articles below this threshold are discarded before database insertion to prevent the LLM from hallucinating facts from insufficient source material. This is especially important for legacy providers that return truncated content.

StandardArticle Contract

Every provider normalizes its API response to a StandardArticle shape before database insertion:

Field	Type	Description
`externalId`	`string`	Provider-unique ID (URL hash for NewsAPI/GNews, UUID for TheNewsAPI)
`sourceName`	`string \| null`	Publication name (e.g., "Reuters")
`sourceDomain`	`string \| null`	Hostname extracted from article URL
`title`	`string`	Article headline
`description`	`string \| null`	Summary or snippet
`articleUrl`	`string`	Canonical article URL
`imageUrl`	`string \| null`	Hero image URL (resolved later if null)
`publishedAt`	`Date \| null`	Publication timestamp
`contentHash`	`string \| null`	Bun wyhash of article content for dedup
`fullContent`	`string \| null`	Full article body (v2 providers only; null for legacy)

Articles are inserted with ON CONFLICT DO NOTHING on (provider, external_id) for automatic cross-request deduplication.

Story Clustering

When enough new articles arrive (threshold: 5), they are clustered into stories using TF-IDF cosine similarity within a 24-hour time window. Articles about the same event are grouped into a single story, preventing duplicate fact extraction.

AI Fact Extraction

Each story is processed by AI to extract structured facts:

Model: Preferred models are gemini-2.5-flash, gpt-5-mini, and claude-haiku-4-5. Model routing is DB-driven via the ai_model_tier_config table for runtime switching without restarts. Each model has a dedicated ModelAdapter (see packages/ai/src/models/) that injects per-model prompt optimizations (suffix/prefix/override modes) to exploit strengths and mitigate weaknesses
Category Resolution: Topic categories are resolved via resolveTopicCategory() — a 3-step alias fallback (exact slug match → provider-specific alias in topic_category_aliases → universal alias). Unresolved slugs are logged to unmapped_category_log for audit
Schema: Facts conform to topic-specific schemas defined in fact_record_schemas
Output: Title, structured fact key-value pairs, context narrative, challenge title, notability score
Expiry: News-derived facts get a 30-day expiry; high-engagement facts auto-promote to enduring

Fact Validation

Every fact passes through a 4-phase validation pipeline before reaching the public feed:

Phase	Name	What It Does	Cost
1	Structural	Schema conformance, type validation, injection detection	$0 (code-only)
2	Consistency	Internal contradictions, taxonomy rule violations	$0 (code-only)
3	Cross-Model	AI adversarial verification via Gemini 2.5 Flash	~$0.001
4	Evidence	External API corroboration (Wikipedia, Wikidata) + AI reasoner (Gemini 2.5 Flash)	~$0.002-0.005

Phases 1-2 are free code-only checks that catch ~40% of defective facts before any AI call is made. Phases 3-4 use Gemini 2.5 Flash for cross-provider verification and evidence corroboration with full phase-by-phase audit trails.

Facts progress through statuses: pending_validation → validated → published (or rejected).

Image Resolution

Every fact record gets an image through a priority cascade:

Priority	Source	Coverage	Cost
1	Wikipedia PageImages	~80% of named entities	Free, no key
2	TheSportsDB	Sports teams, athletes	Free key
3	Unsplash	Topical photos (landscapes, abstract)	Free key
4	Pexels	Topical photos (alternative pool)	Free key
5	null	UI shows placeholder	N/A

Wikipedia is the primary source because facts are entity-centric and Wikipedia covers most notable entities. Unsplash and Pexels are topical fallbacks for abstract topics (e.g., "quantum computing breakthrough") where no entity-specific image exists.

Challenge-level images are resolved separately via RESOLVE_CHALLENGE_IMAGE, which runs the same cascade but stores results on fact_challenge_content rows rather than the parent fact_records row.

Attribution Requirements

Source	Required Attribution
Wikipedia	"Image from Wikipedia: {page_title}"
TheSportsDB	"Image from TheSportsDB: {team_name}"
Unsplash	"Photo by {photographer} on Unsplash"
Pexels	"Photo by {photographer} on Pexels"

Challenge Content Generation

After facts are validated, the GENERATE_CHALLENGE_CONTENT queue produces pre-generated challenge material for each style:

statement_blank: "_____ is the capital of France"
direct_question: "What is the capital of France?"
fill_the_gap: Sentence with masked answer
multiple_choice: Question with 4 options
reverse_lookup: Given the answer, identify the subject
free_text: Open-ended question with AI grading

Each piece of content includes a three-layer emotional arc: setup_text, challenge_text, reveal_correct, and reveal_wrong — following the Eko voice constitution.

Challenge content generation uses micro-batching: the worker accumulates up to 5 queue messages over a 500ms window, then makes a single AI call per batch to amortize the ~5,200-token system prompt.

Evergreen Generation

Beyond news-derived facts, the pipeline generates "evergreen" knowledge facts via AI for each topic category:

Quota: configurable per-day (EVERGREEN_DAILY_QUOTA, default: 20)
Topic balance: daily quotas per category prevent content monoculture
These facts have no expiry and provide a stable content baseline

Feed Algorithm

The user feed blends four content streams:

Stream	Weight	Source
Recent validated	40%	Newly published facts
Review-due	30%	Spaced repetition (SM-2 variant)
Evergreen	20%	Knowledge facts with no expiry
Exploration	10%	Random facts for discovery

Cards are interleaved round-robin and include a userStatus badge (in-progress, review-due, mastered).

Environment Configuration

News API Keys

Variable	Provider	Required
`EVENT_REGISTRY_API_KEY`	Event Registry (NewsAPI.ai)	Optional (v2, recommended)
`NEWSDATA_API_KEY`	Newsdata.io	Optional (v2)
`NEWS_API_KEY`	NewsAPI.org	Optional (legacy)
`GOOGLE_NEWS_API_KEY`	GNews	Optional (legacy)
`THENEWS_API_KEY`	TheNewsAPI	Optional (legacy)

Image API Keys

Variable	Provider	Required
(none)	Wikipedia	Always available
(none)	TheSportsDB	Always available (free key: `"3"`)
`UNSPLASH_ACCESS_KEY`	Unsplash	Optional
`PEXELS_API_KEY`	Pexels	Optional

AI Provider Keys

Variable	Provider	Required
`OPENAI_API_KEY`	OpenAI	Required (default tier)
`ANTHROPIC_API_KEY`	Anthropic	Optional (mid/high tier)
`GOOGLE_API_KEY`	Google AI	Required for validation pipeline (Gemini 2.5 Flash)

Note: Model routing is DB-driven via ai_model_tier_config. Each model can have a dedicated ModelAdapter for per-model prompt optimization. See Model Code Isolation.

Processing Settings

Variable	Default	Description
`NEWS_INGESTION_INTERVAL_MINUTES`	15	Cron polling frequency
`FACT_EXTRACTION_BATCH_SIZE`	10	Stories per extraction batch
`VALIDATION_MIN_SOURCES`	2	Minimum sources for multi_source validation
`NOTABILITY_THRESHOLD`	0.6	Minimum notability score (0-1)
`EVERGREEN_DAILY_QUOTA`	20	Evergreen facts per day
`EVERGREEN_ENABLED`	false	Enable evergreen generation

Cost Model

Development (Free Tier)

Component	Monthly Cost
Event Registry (free, 2,000 tokens)	$0
Newsdata.io (free)	$0
NewsAPI.org (free)	$0
GNews (free)	$0
TheNewsAPI (free)	$0
All image APIs	$0
Total	$0

With Event Registry as the primary provider, the free tier delivers high-quality full-body articles for approximately 13 months before token exhaustion.

Budget Production

Component	Monthly Cost
Event Registry $90/mo plan	$90
All image APIs	$0
Total	~$90

Delivers full-body articles at scale without legacy provider limitations.

Full Production

Component	Monthly Cost
Event Registry $90/mo plan	$90
NewsAPI.org Business	$449
GNews Essential	~$50
TheNewsAPI Basic	$19
All image APIs	$0
Total	~$610

Worker Architecture

Three dedicated workers process the pipeline:

Worker	Queue Types	Purpose
`worker-ingest`	INGEST_NEWS, CLUSTER_STORIES, RESOLVE_IMAGE, RESOLVE_CHALLENGE_IMAGE	Fetch, cluster, resolve images
`worker-facts`	EXTRACT_FACTS, IMPORT_FACTS, GENERATE_EVERGREEN, EXPLODE_CATEGORY_ENTRY, FIND_SUPER_FACTS, GENERATE_CHALLENGE_CONTENT	AI extraction and content generation
`worker-validate`	VALIDATE_FACT	Multi-tier fact verification

All workers use Bun with Upstash Redis queues, exponential backoff polling, and scale-to-zero idle exit.

Key Files

File	Purpose
`apps/worker-ingest/src/handlers/ingest-news.ts`	News provider implementations and dispatch
`apps/worker-ingest/src/handlers/resolve-image.ts`	Image resolution cascade (fact-level)
`apps/worker-ingest/src/handlers/resolve-challenge-image.ts`	Image resolution cascade (challenge-level)
`apps/web/app/api/cron/ingest-news/route.ts`	Cron that enqueues INGEST_NEWS per provider
`packages/shared/src/schemas.ts`	Queue message Zod schemas
`packages/queue/src/index.ts`	Queue client, message constructors
`packages/config/src/index.ts`	FactEngineConfig and API key management
`packages/db/src/drizzle/schema.ts`	news_sources, stories, fact_records tables
`packages/db/src/drizzle/fact-engine-queries.ts`	Pipeline query functions

#News & Fact Engine

#What It Does

#Pipeline Architecture

#News Providers

#V2 Providers (Full-Content)

#Legacy Providers (Truncated)

#Provider Selection Logic

#Quality Filter

#StandardArticle Contract

#Story Clustering

#AI Fact Extraction

#Fact Validation

#Image Resolution

#Attribution Requirements

#Challenge Content Generation

#Evergreen Generation

#Feed Algorithm

#Environment Configuration

#News API Keys

#Image API Keys

#AI Provider Keys

#Processing Settings

#Cost Model

#Development (Free Tier)

#Budget Production

#Full Production

#Worker Architecture

#Key Files