News & Fact Engine

How Eko turns the world's news into structured, verified facts delivered as interactive challenges.

What It Does

The fact engine is a multi-stage pipeline that:

  1. Ingests articles from multiple news APIs every 15 minutes
  2. Filters articles by content length to prevent LLM hallucination from thin snippets
  3. Clusters related articles into stories using TF-IDF similarity
  4. Extracts structured facts from stories using AI
  5. Validates each fact through multi-tier verification
  6. Generates challenge content (quiz, recall, free-text) per style
  7. Resolves images through a priority cascade of free APIs
  8. Publishes to a blended feed with spaced repetition scheduling

The result: a continuously refreshed stream of verified, image-backed, challenge-ready fact cards.

Pipeline Architecture

  News APIs                Seed Files
  (Event Registry,         (XLSX, CSV,
   Newsdata, NewsAPI,       AI super-facts)
   GNews, TheNewsAPI)
       │                        │
       ▼                        │
 ┌─────────────┐               │
 │ news_sources │               │
 │ (dedup via   │               │
 │  content_hash)               │
 │ + fullContent │              │
 └──────┬──────┘               │
        │ quality filter        │
        │ (≥400 chars)          │
        │ CLUSTER_STORIES       │
        ▼                       │
   ┌─────────┐                  │
   │ stories  │                  │
   └────┬────┘                  │
        │ EXTRACT_FACTS    IMPORT_FACTS
        ▼                       │
  ┌──────────────┐◄────────────┘
  │ fact_records  │
  │ (pending →    │──── RESOLVE_IMAGE ──► image cascade
  │  validated →  │──── VALIDATE_FACT ──► tiered verification
  │  published)   │──── GENERATE_CHALLENGE_CONTENT
  └──────┬───────┘
         │
         ▼
  ┌──────────────────────┐
  │ fact_challenge_content│  Pre-generated per style
  │ + RESOLVE_CHALLENGE   │  + per-challenge image resolution
  │   _IMAGE              │
  └──────────────────────┘
         │
         ▼
    User Feed (blended: 40% recent, 30% review-due, 20% evergreen, 10% explore)

News Providers

Five external news APIs feed the pipeline. V2 providers deliver full article bodies for higher-quality fact extraction; legacy providers return truncated snippets (~256 chars) and are kept for backwards compatibility.

V2 Providers (Full-Content)

ProviderAPIAuthFree TierProd Tier
Event Registrynewsapi.ai/api/v1Query: apiKey2,000 tokens (one-time, ~200K articles)$90/mo 5K plan
Newsdata.ionewsdata.io/api/1Query: apikeyTitle + description onlyPaid tiers: full body

Event Registry is the primary v2 provider — it returns full article bodies (avg 3,000-5,000 chars) on its free tier, enabling significantly better fact extraction than truncated snippets.

Legacy Providers (Truncated)

ProviderAPIAuthFree TierProd Tier
NewsAPI.orgnewsapi.org/v2Header: X-Api-Key100 req/day, 24h delayBusiness $449/mo
GNewsgnews.io/api/v4Query: apikey100 req/day, 12h delayEssential ~$50/mo
TheNewsAPIthenewsapi.com/v1Query: api_token100 req/day, 3 articles/reqBasic $19/mo

Provider Selection Logic

The cron route (apps/web/api/cron/ingest-news) checks which providers have API keys configured and enqueues one INGEST_NEWS message per provider per active root-level topic category (queried with maxDepth: 0 to prevent quota explosion when subcategories exist).

V2 providers are checked first, then legacy providers:

  1. EVENT_REGISTRY_API_KEY present → enqueue event_registry
  2. NEWSDATA_API_KEY present → enqueue newsdata
  3. NEWS_API_KEY present → enqueue newsapi
  4. GOOGLE_NEWS_API_KEY present → enqueue gnews
  5. THENEWS_API_KEY present → enqueue thenewsapi

Quality Filter

All providers pass through a MIN_ARTICLE_TEXT_LENGTH filter (400 characters, measured as title.length + description.length). Articles below this threshold are discarded before database insertion to prevent the LLM from hallucinating facts from insufficient source material. This is especially important for legacy providers that return truncated content.

StandardArticle Contract

Every provider normalizes its API response to a StandardArticle shape before database insertion:

FieldTypeDescription
externalIdstringProvider-unique ID (URL hash for NewsAPI/GNews, UUID for TheNewsAPI)
sourceNamestring | nullPublication name (e.g., "Reuters")
sourceDomainstring | nullHostname extracted from article URL
titlestringArticle headline
descriptionstring | nullSummary or snippet
articleUrlstringCanonical article URL
imageUrlstring | nullHero image URL (resolved later if null)
publishedAtDate | nullPublication timestamp
contentHashstring | nullBun wyhash of article content for dedup
fullContentstring | nullFull article body (v2 providers only; null for legacy)

Articles are inserted with ON CONFLICT DO NOTHING on (provider, external_id) for automatic cross-request deduplication.

Story Clustering

When enough new articles arrive (threshold: 5), they are clustered into stories using TF-IDF cosine similarity within a 24-hour time window. Articles about the same event are grouped into a single story, preventing duplicate fact extraction.

AI Fact Extraction

Each story is processed by AI to extract structured facts:

  • Model: Preferred models are gemini-2.5-flash, gpt-5-mini, and claude-haiku-4-5. Model routing is DB-driven via the ai_model_tier_config table for runtime switching without restarts. Each model has a dedicated ModelAdapter (see packages/ai/src/models/) that injects per-model prompt optimizations (suffix/prefix/override modes) to exploit strengths and mitigate weaknesses
  • Category Resolution: Topic categories are resolved via resolveTopicCategory() — a 3-step alias fallback (exact slug match → provider-specific alias in topic_category_aliases → universal alias). Unresolved slugs are logged to unmapped_category_log for audit
  • Schema: Facts conform to topic-specific schemas defined in fact_record_schemas
  • Output: Title, structured fact key-value pairs, context narrative, challenge title, notability score
  • Expiry: News-derived facts get a 30-day expiry; high-engagement facts auto-promote to enduring

Fact Validation

Every fact passes through a 4-phase validation pipeline before reaching the public feed:

PhaseNameWhat It DoesCost
1StructuralSchema conformance, type validation, injection detection$0 (code-only)
2ConsistencyInternal contradictions, taxonomy rule violations$0 (code-only)
3Cross-ModelAI adversarial verification via Gemini 2.5 Flash~$0.001
4EvidenceExternal API corroboration (Wikipedia, Wikidata) + AI reasoner (Gemini 2.5 Flash)~$0.002-0.005

Phases 1-2 are free code-only checks that catch ~40% of defective facts before any AI call is made. Phases 3-4 use Gemini 2.5 Flash for cross-provider verification and evidence corroboration with full phase-by-phase audit trails.

Facts progress through statuses: pending_validationvalidatedpublished (or rejected).

Image Resolution

Every fact record gets an image through a priority cascade:

PrioritySourceCoverageCost
1Wikipedia PageImages~80% of named entitiesFree, no key
2TheSportsDBSports teams, athletesFree key
3UnsplashTopical photos (landscapes, abstract)Free key
4PexelsTopical photos (alternative pool)Free key
5nullUI shows placeholderN/A

Wikipedia is the primary source because facts are entity-centric and Wikipedia covers most notable entities. Unsplash and Pexels are topical fallbacks for abstract topics (e.g., "quantum computing breakthrough") where no entity-specific image exists.

Challenge-level images are resolved separately via RESOLVE_CHALLENGE_IMAGE, which runs the same cascade but stores results on fact_challenge_content rows rather than the parent fact_records row.

Attribution Requirements

SourceRequired Attribution
Wikipedia"Image from Wikipedia: {page_title}"
TheSportsDB"Image from TheSportsDB: {team_name}"
Unsplash"Photo by {photographer} on Unsplash"
Pexels"Photo by {photographer} on Pexels"

Challenge Content Generation

After facts are validated, the GENERATE_CHALLENGE_CONTENT queue produces pre-generated challenge material for each style:

  • statement_blank: "_____ is the capital of France"
  • direct_question: "What is the capital of France?"
  • fill_the_gap: Sentence with masked answer
  • multiple_choice: Question with 4 options
  • reverse_lookup: Given the answer, identify the subject
  • free_text: Open-ended question with AI grading

Each piece of content includes a three-layer emotional arc: setup_text, challenge_text, reveal_correct, and reveal_wrong — following the Eko voice constitution.

Challenge content generation uses micro-batching: the worker accumulates up to 5 queue messages over a 500ms window, then makes a single AI call per batch to amortize the ~5,200-token system prompt.

Evergreen Generation

Beyond news-derived facts, the pipeline generates "evergreen" knowledge facts via AI for each topic category:

  • Quota: configurable per-day (EVERGREEN_DAILY_QUOTA, default: 20)
  • Topic balance: daily quotas per category prevent content monoculture
  • These facts have no expiry and provide a stable content baseline

Feed Algorithm

The user feed blends four content streams:

StreamWeightSource
Recent validated40%Newly published facts
Review-due30%Spaced repetition (SM-2 variant)
Evergreen20%Knowledge facts with no expiry
Exploration10%Random facts for discovery

Cards are interleaved round-robin and include a userStatus badge (in-progress, review-due, mastered).

Environment Configuration

News API Keys

VariableProviderRequired
EVENT_REGISTRY_API_KEYEvent Registry (NewsAPI.ai)Optional (v2, recommended)
NEWSDATA_API_KEYNewsdata.ioOptional (v2)
NEWS_API_KEYNewsAPI.orgOptional (legacy)
GOOGLE_NEWS_API_KEYGNewsOptional (legacy)
THENEWS_API_KEYTheNewsAPIOptional (legacy)

Image API Keys

VariableProviderRequired
(none)WikipediaAlways available
(none)TheSportsDBAlways available (free key: "3")
UNSPLASH_ACCESS_KEYUnsplashOptional
PEXELS_API_KEYPexelsOptional

AI Provider Keys

VariableProviderRequired
OPENAI_API_KEYOpenAIRequired (default tier)
ANTHROPIC_API_KEYAnthropicOptional (mid/high tier)
GOOGLE_API_KEYGoogle AIRequired for validation pipeline (Gemini 2.5 Flash)

Note: Model routing is DB-driven via ai_model_tier_config. Each model can have a dedicated ModelAdapter for per-model prompt optimization. See Model Code Isolation.

Processing Settings

VariableDefaultDescription
NEWS_INGESTION_INTERVAL_MINUTES15Cron polling frequency
FACT_EXTRACTION_BATCH_SIZE10Stories per extraction batch
VALIDATION_MIN_SOURCES2Minimum sources for multi_source validation
NOTABILITY_THRESHOLD0.6Minimum notability score (0-1)
EVERGREEN_DAILY_QUOTA20Evergreen facts per day
EVERGREEN_ENABLEDfalseEnable evergreen generation

Cost Model

Development (Free Tier)

ComponentMonthly Cost
Event Registry (free, 2,000 tokens)$0
Newsdata.io (free)$0
NewsAPI.org (free)$0
GNews (free)$0
TheNewsAPI (free)$0
All image APIs$0
Total$0

With Event Registry as the primary provider, the free tier delivers high-quality full-body articles for approximately 13 months before token exhaustion.

Budget Production

ComponentMonthly Cost
Event Registry $90/mo plan$90
All image APIs$0
Total~$90

Delivers full-body articles at scale without legacy provider limitations.

Full Production

ComponentMonthly Cost
Event Registry $90/mo plan$90
NewsAPI.org Business$449
GNews Essential~$50
TheNewsAPI Basic$19
All image APIs$0
Total~$610

Worker Architecture

Three dedicated workers process the pipeline:

WorkerQueue TypesPurpose
worker-ingestINGEST_NEWS, CLUSTER_STORIES, RESOLVE_IMAGE, RESOLVE_CHALLENGE_IMAGEFetch, cluster, resolve images
worker-factsEXTRACT_FACTS, IMPORT_FACTS, GENERATE_EVERGREEN, EXPLODE_CATEGORY_ENTRY, FIND_SUPER_FACTS, GENERATE_CHALLENGE_CONTENTAI extraction and content generation
worker-validateVALIDATE_FACTMulti-tier fact verification

All workers use Bun with Upstash Redis queues, exponential backoff polling, and scale-to-zero idle exit.

Key Files

FilePurpose
apps/worker-ingest/src/handlers/ingest-news.tsNews provider implementations and dispatch
apps/worker-ingest/src/handlers/resolve-image.tsImage resolution cascade (fact-level)
apps/worker-ingest/src/handlers/resolve-challenge-image.tsImage resolution cascade (challenge-level)
apps/web/app/api/cron/ingest-news/route.tsCron that enqueues INGEST_NEWS per provider
packages/shared/src/schemas.tsQueue message Zod schemas
packages/queue/src/index.tsQueue client, message constructors
packages/config/src/index.tsFactEngineConfig and API key management
packages/db/src/drizzle/schema.tsnews_sources, stories, fact_records tables
packages/db/src/drizzle/fact-engine-queries.tsPipeline query functions