Eko Product Bible

The "read this first" document for anyone joining the Eko team — engineers, designers, marketers, and business stakeholders.

Eko is a knowledge platform that builds verified, structured fact cards from multiple sources — breaking news, AI-generated evergreen knowledge, and curated seed content. Users learn through interactive challenges — quizzes, recall exercises, and conversational AI sessions — powered by spaced repetition. Think of it like a factory with three assembly lines: one processes raw news, one generates timeless knowledge, and one bootstraps new topic areas — all producing the same high-quality, verified knowledge cards.

The core loop: sources → facts → validation → cards → learning.

Three primary content pipelines feed Eko:

Pipeline	Source Type	What it produces	Trigger
News	`news_extraction`	Facts derived from clustered news articles	Cron-driven (every 15 min)
Evergreen	`ai_generated`	Timeless knowledge facts not tied to current events	Cron-driven (daily)
Seed	`file_seed`, `spinoff_discovery`, `ai_super_fact`	Bootstrapped facts for new topic categories	Manual / on-demand

All three pipelines converge at the same point: every fact goes through validation, image resolution, and challenge generation before reaching the feed.

Who uses Eko and why:

Audience	What they get
End users	A daily feed of verified knowledge cards with quizzes, recall, and AI challenges
Content team	Seeding tools to bootstrap new topic categories with high-quality facts
Engineers	A well-structured pipeline with clear ownership, CI enforcement, and specialized agents
Business	Subscription-gated detail pages (Free tier = feed; Eko+ = full card detail and interactions)

1. The Pipeline — How a Fact Is Born

Facts enter Eko through three pipelines — news, evergreen, and seed — but all converge into the same validation → image → challenge → feed path. Here is the full picture.

News APIs ─┐
           ├──▶ [INGEST_NEWS] ──▶ worker-ingest ──▶ news_sources table
           │                                              │
           │                        ┌─────────────────────┘
           │                        ▼
           │              [CLUSTER_STORIES] ──▶ worker-ingest ──▶ story_clusters
           │                                              │
           │                        ┌─────────────────────┘
           │                        ▼
           │              [EXTRACT_FACTS] ──▶ worker-facts ──▶ fact_records
           │                                              │
           │                  ┌───────────┬───────────────┘
           │                  ▼           ▼
           │         [VALIDATE_FACT]  [RESOLVE_IMAGE]
           │              │               │
           │              ▼               ▼
           │         worker-validate  worker-ingest
           │              │               │
           │              ▼               ▼
           │         fact verified    image cached
           │              │
           │              ▼
           │    [GENERATE_CHALLENGE_CONTENT] ──▶ worker-facts
           │              │
           │              ▼
           │    fact_challenge_content (6 styles × 5 difficulties)
           │
Seed Data ─┤
           ├──▶ [EXPLODE_CATEGORY_ENTRY] ──▶ worker-facts ──▶ fact_records
           ├──▶ [FIND_SUPER_FACTS] ──▶ worker-facts ──▶ cross-correlations
           └──▶ [GENERATE_CHALLENGE_CONTENT] ──▶ worker-facts ──▶ challenge_content

Evergreen ────▶ [GENERATE_EVERGREEN] ──▶ worker-facts ──▶ fact_records

Pipeline A: News Facts (source_type = `news_extraction`)

Current-events facts derived from real-time news articles.

1a. Ingestion

Raw articles are fetched from news APIs (NewsAPI, GNews, TheNewsAPI) and stored in news_sources.

Queue: INGEST_NEWS → consumed by worker-ingest
Trigger: cron/ingest-news (intended every 15 minutes)
What happens: The cron dispatches one queue message per provider × active root-level topic category (queried with maxDepth: 0 to prevent quota explosion when subcategories exist). The worker fetches articles, deduplicates by URL and content_hash, and inserts into news_sources.

What happens if... the news API is down? The queue message fails, backs off exponentially (5s → 30s → 60s cap, 15% jitter), and retries up to 3 times. After 3 failures it moves to the dead-letter queue (DLQ). No data loss — the next cron cycle dispatches fresh messages.

1b. Clustering

Recent articles are grouped into stories using TF-IDF cosine similarity.

Queue: CLUSTER_STORIES → consumed by worker-ingest
Trigger: cron/cluster-sweep (intended hourly)
What happens: Unclustered articles older than 1 hour are batched and clustered. Similar articles get grouped into a single story_clusters row, which becomes the input for fact extraction.

1c. Extraction

AI extracts structured facts from story clusters.

Queue: EXTRACT_FACTS → consumed by worker-facts
Trigger: Automatic after clustering
What happens: The AI model (routed by the model router via ai_model_tier_config — see Section 4) reads the clustered articles (capped at 5 sources × 1,500 chars each to control prompt tokens) and produces structured fact records. Each fact has a title, key-value facts (validated against fact_record_schemas.fact_keys), a notability score, narrative context (Hook→Story→Connection, 4-8 sentences), and a theatrical challenge title. Facts are inserted with source_type = 'news_extraction' and linked to their source story. Per-model ModelAdapters inject prompt optimizations to exploit model strengths and mitigate weaknesses.
Category Resolution: Topic categories are resolved via resolveTopicCategory() — a 3-step alias fallback (exact slug match → provider-specific alias in topic_category_aliases → universal alias). Unresolved slugs are logged to unmapped_category_log for audit.

Pipeline B: Evergreen Facts (source_type = `ai_generated`)

Timeless knowledge facts not tied to current events — the kind of content that stays accurate and interesting indefinitely. Examples: "The speed of light is 299,792,458 m/s", "The Eiffel Tower was originally intended to be temporary." Evergreen facts are a co-equal content pillar alongside news facts; they ensure the feed always has high-quality content even when news cycles are slow.

1d. Evergreen Generation

AI generates structured facts for a given topic category, deduplicated against existing titles (capped at 50 to control prompt token growth).

Queue: GENERATE_EVERGREEN → consumed by worker-facts
Trigger: cron/generate-evergreen (intended daily at 3AM UTC)
Model tier: mid (higher quality than default, because long-lived content quality matters more)
What happens:
1. The cron dispatches one message per active topic category with a count (default 20/day controlled by EVERGREEN_DAILY_QUOTA)
2. The handler fetches existing fact titles for the topic to prevent duplicates
3. AI generates structured fact records using the topic's schema keys, taxonomy content rules, taxonomy voice, and domain vocabulary
4. Each generated fact is inserted with source_type = 'ai_generated', status = 'pending_validation'
5. Each fact is immediately enqueued for VALIDATE_FACT with the multi_phase strategy (same rigor as news facts)
Controls: EVERGREEN_ENABLED (master switch, default false), EVERGREEN_DAILY_QUOTA (max facts/day, default 20)
Cost tracking: Total AI cost is split evenly across generated records and stored per-record in generation_cost_usd

What happens if... evergreen generation is disabled? The feed still works — it draws from existing validated facts, news-derived content, and spaced repetition reviews. Evergreen is additive, not required.

Pipeline C: Seed Facts (source_types = `file_seed`, `spinoff_discovery`, `ai_super_fact`)

New topic categories start empty. The seed pipeline bootstraps them with high-quality AI-generated content from curated entries.

1e. Seed Explosion

Curated seed entries are "exploded" into many structured facts.

Queue: EXPLODE_CATEGORY_ENTRY → consumed by worker-facts
Trigger: Manual via seed scripts (on-demand)
What happens: Each curated entry (e.g., a notable person, event, or concept) is expanded into 10-100 structured facts with theatrical titles and rich narrative context. Primary facts get source_type = 'file_seed'; discovered tangential facts get source_type = 'spinoff_discovery'. All are enqueued for validation via IMPORT_FACTS. The seed pipeline receives full taxonomy context — content rules, voice, and domain vocabulary — for domain-aware generation.

1f. Super Fact Discovery

AI finds cross-entry correlations — facts that connect multiple seed entries. Entry summaries are populated with actual fact titles (3 per entry) for real signal.

Queue: FIND_SUPER_FACTS → consumed by worker-facts
Trigger: After seed explosion completes for a batch
What happens: The AI compares facts across entries to find meaningful connections (e.g., "Both X and Y studied at the same university"). Super facts are inserted with source_type = 'ai_super_fact' and linked to their parent entries via super_fact_links.

Shared Pipeline: Validation → Image → Challenge → Feed

All three pipelines converge here. Every fact — regardless of source — goes through the same quality gates.

1g. Validation

Every fact goes through multi-phase verification before it reaches the feed.

Queue: VALIDATE_FACT → consumed by worker-validate
Trigger: Automatic after extraction/generation, plus cron/validation-retry every 4 hours for stuck facts
What happens: Four validation phases run in sequence:

Phase	Name	What It Does	Cost
1	Structural	Schema conformance, type validation, injection detection	$0 (code-only)
2	Consistency	Internal contradictions, taxonomy rule violations	$0 (code-only)
3	Cross-Model	AI adversarial verification via Gemini 2.5 Flash with recency-aware severity calibration	~$0.001
4	Evidence	External API corroboration (Wikipedia, Wikidata) + AI reasoner (Gemini 2.5 Flash)	~$0.002-0.005

Phases 1-2 are free code-only checks that catch ~40% of defective facts before any AI call is made. Phase 3 uses severity calibration (defaults to info for simplifications, warning for material errors) and recency-aware rules for news articles (unverifiable-due-to-recency = info, lower pass threshold 0.35 vs 0.50). Phase 4 uses multi-strategy Wikipedia entity extraction (possessives, quoted names, proper nouns, topic path hints, MediaWiki search fallback) achieving ~85% lookup success.

Validation strategy varies by source: News and AI-generated facts use multi_phase. API imports use authoritative_api. Manual entries use curated_database.
Contamination detection: isResponseContaminated() performs entity-name sanity checks with automatic retry on cross-model and evidence phases.
Graduated penalties: likely_inaccurate gets -0.15 confidence penalty (not hard fail); schema_mismatch warnings are filtered from evidence escalation triggers.

What happens if... validation fails? The fact stays in pending status and never reaches the public feed. The validation-retry cron re-enqueues stuck facts every 4 hours. After 3 failed attempts the message goes to DLQ for manual inspection.

1h. Image Resolution

Facts get images resolved through a priority cascade of free APIs.

Queue: RESOLVE_IMAGE → consumed by worker-ingest
Trigger: Automatic after extraction/generation
What happens: The worker searches through a priority cascade:

Priority	Source	Coverage	Cost
1	Wikipedia PageImages	~80% of named entities	Free, no key
2	TheSportsDB	Sports teams, athletes	Free key
3	Unsplash	Topical photos (landscapes, abstract)	Free key
4	Pexels	Topical photos (alternative pool)	Free key
5	null	UI shows placeholder	N/A

1i. Challenge Generation

Pre-computed challenge content is generated for each validated fact. Challenge generation is triggered after validation passes (not before), avoiding wasted AI cost on rejected facts.

Queue: GENERATE_CHALLENGE_CONTENT → consumed by worker-facts
Trigger: Automatic after validation passes (enqueued from validate-fact.ts)
What happens: AI generates challenge content for 6 quiz styles, each at up to 5 difficulty levels, using a three-layer voice system:

Voice Stack (injected in order):

CHALLENGE_VOICE_CONSTITUTION — Universal Eko voice (playful, curious, wonder-driven)
TAXONOMY_VOICE — Per-domain emotional register (e.g., sports = energetic, history = contemplative) for 33+ taxonomies
STYLE_VOICE — Per-format interaction mechanics (gallery guide, dinner companion, co-author, etc.)
STYLE_RULES — Tease-and-hint architecture with ANCHOR→ESCALATION→WITHHOLD arcs

Challenge Styles:

fill_the_gap — Sentence with masked answer
direct_question — "What is the capital of France?"
statement_blank — "_____ is the capital of France"
multiple_choice — Question with 4 options
reverse_lookup — Given the answer, identify the subject
free_text — Open-ended question with AI grading

Each piece of content includes: setup_text, challenge_text, correct_answer, reveal_correct, reveal_wrong, and typed style_data.

Per-challenge titles: Each challenge gets its own theatrical title (moved from fact-level), preventing answer-leak bugs.

Quality enforcement:

Drift coordinator system (5 pluggable coordinators: structure, schema, voice, taxonomy, difficulty) detects semantic drift in generated content
CQ-002 validation ensures second-person address in reveals
Banned pattern detection blocks "trivia", "quiz", "easy one" in reveals
Post-generation patchers: patchPassiveVoice() (36 patterns), patchGenericReveals() (21 semantic families, ~75% coverage), patchTextbookRegister(), patchPunctuationSpacing()
ModelAdapter-specific guardrails (e.g., Gemini 2.5 Flash has 40+ banned reveal openings, factual accuracy guardrails v6)

1j. Feed Display

Validated facts from all pipelines appear in the user's feed with a blended algorithm.

No queue — served by the /api/feed API endpoint
Blend: 40% recent validated facts, 30% facts due for spaced repetition review, 20% evergreen facts, 10% random exploration
Gating: The feed itself is public. Full card detail and interactions (quiz, recall, challenges) require a Free or Eko+ subscription.
Source-agnostic: The feed algorithm doesn't distinguish between news, evergreen, and seed facts. Once validated, they're all equal.

2. The Architecture — Apps, Packages, and How They Connect

Think of it like... packages are LEGO bricks; apps are assembled kits. Each package does one thing well, and apps compose them into user-facing products.

Apps

App	URL	Purpose
`apps/web`	app.eko.day	Authenticated app — feed, card detail, challenges, account
`apps/admin`	admin.eko.day	Admin dashboard — content moderation, queue monitoring, billing
`apps/public`	eko.day	Public marketing site — home, pricing, features, about
`apps/worker-ingest`	—	Queue consumer: news ingestion, story clustering, image resolution
`apps/worker-facts`	—	Queue consumer: fact extraction, evergreen, challenges, seeding
`apps/worker-validate`	—	Queue consumer: multi-phase fact validation

Deprecated workers (stubs only): apps/worker-reel-render, apps/worker-sms

Packages

┌─────────────────────────────────────────────────────┐
│                    apps layer                        │
│   web    admin    public    worker-*                  │
└────┬───────┬────────┬─────────┬─────────────────────┘
     │       │        │         │
     ▼       ▼        ▼         ▼
┌─────────────────────────────────────────────────────┐
│                 packages layer                       │
│                                                      │
│  shared ◄── schemas, types, utilities                │
│  config ◄── env vars, model registry, TS data files  │
│  db     ◄── Supabase client, Drizzle ORM, queries   │
│  ai     ◄── extraction, validation, model router,    │
│             model adapters, challenge content,        │
│             drift coordinators, taxonomy voice        │
│  queue  ◄── Upstash Redis queue client               │
│  email  ◄── Resend email templates                   │
│  stripe ◄── billing integration                      │
│  r2     ◄── Cloudflare R2 object storage             │
│  observability ◄── structured logging                │
│  ui     ◄── shadcn/ui components (authenticated)     │
│  ui-public ◄── public site components                │
│  reel-schemas ◄── video schema definitions           │
└─────────────────────────────────────────────────────┘

Queue System

Backend: Upstash Redis (REST API). Max 3 attempts before dead-letter queue. Exponential backoff with jitter (5s → 30s → 60s cap, 15% jitter).

Queue Type	Consumer	Status	Trigger
`INGEST_NEWS`	worker-ingest	active	cron (every 15m)
`CLUSTER_STORIES`	worker-ingest	active	cron (hourly)
`RESOLVE_IMAGE`	worker-ingest	active	post-extraction
`EXTRACT_FACTS`	worker-facts	active	post-clustering
`IMPORT_FACTS`	worker-facts	stub	cron (not active)
`GENERATE_EVERGREEN`	worker-facts	active	cron (daily)
`EXPLODE_CATEGORY_ENTRY`	worker-facts	active	seed pipeline
`FIND_SUPER_FACTS`	worker-facts	active	seed pipeline
`GENERATE_CHALLENGE_CONTENT`	worker-facts	active	post-validation
`VALIDATE_FACT`	worker-validate	active	post-extraction + cron (4h retry)
`SEND_SMS`	none	deprecated	—

Database

Supabase (Postgres) with Row-Level Security. 135 migrations (0001-0135) across 7 phases.

Key concept tables:

Table	Purpose
`topic_categories`	Hierarchical topic taxonomy (33+ root categories, 76+ subcategories) with alias resolution
`topic_category_aliases`	Maps external news API slugs to internal categories (3-step fallback)
`fact_record_schemas`	Per-topic Zod-validated key definitions (`fact_keys`) — 33+ domain-specific schemas
`fact_records`	The atomic unit — one verified fact with structured key-value data
`stories`	Clustered news articles that facts are extracted from
`news_sources`	Raw articles fetched from news APIs
`card_interactions`	User engagement (views, answers, bookmarks, shares) with continuous 0.0-1.0 scoring
`fact_challenge_content`	Pre-generated AI challenge text per style, difficulty, and target_fact_key
`challenge_formats`	8 named challenge formats (Big Fan Of, Know A Lot About, etc.)
`challenge_format_styles`	Style-to-format junction (which styles belong to which formats)
`challenge_format_topics`	Topic-to-format eligibility junction
`challenge_sessions`	Multi-turn conversational AI challenge state
`user_subscriptions`	Free/Eko+ subscription status
`ai_cost_log`	Per-call AI spend tracking for budget enforcement
`ai_cost_tracking`	Daily cost aggregation by provider, model, and feature
`ai_model_tier_config`	DB-driven model tier routing (changeable via SQL, no restart)
`score_disputes`	AI-judged score disputes with decision types
`reward_milestones` / `user_reward_claims`	Engagement rewards (100/500/1000/2000 points → free Eko+ days)
`seed_entry_queue`	Priority-ordered seed entry consumption
`super_fact_links`	Cross-entry correlation junction for super facts
`unmapped_category_log`	Audit log for unresolved news API category slugs

3. Cascading Effects — What Breaks What

Eko has seven critical dependency chains. Understanding these prevents accidental breakage.

3a. Shared Schemas

packages/shared/src/schemas.ts
  ├── packages/queue (message validation)
  ├── packages/ai (extraction output schemas)
  ├── packages/db (query type safety)
  ├── apps/web (API request/response validation)
  ├── apps/admin (content display types)
  └── apps/worker-* (message parsing)

What: The @eko/shared package exports Zod schemas for every queue message type, every domain entity, and every API contract. It is imported by every other package and app.

Why it matters: A breaking change to a Zod schema (renaming a field, changing a type) cascades to every consumer. If INGEST_NEWS payload shape changes, packages/queue fails to validate, worker-ingest fails to parse, and the entire ingestion pipeline stops.

Safe change pattern: Add optional fields (non-breaking). For required field changes, update all consumers in the same PR and run bun run typecheck across the monorepo.

What happens if... a schema field is removed? Detection: bun run typecheck fails immediately across dependent packages. Recovery: Revert the change or update all consumers before merging.

3b. Queue System

packages/queue/src/index.ts
  ├── apps/worker-ingest (3 queue types)
  ├── apps/worker-facts (6 queue types)
  ├── apps/worker-validate (1 queue type)
  └── apps/web/app/api/cron/* (enqueue messages)

What: The queue package provides enqueue(), dequeue(), ack(), and nack() operations backed by Upstash Redis. Every cron and every worker depends on it.

Why it matters: If the queue system is misconfigured (bad Redis URL, schema mismatch), all async processing stops. Facts pile up unprocessed, validation stalls, and the feed goes stale.

Safe change pattern: Test queue changes with SOAK_QUEUE_SUFFIX for isolated testing before deploying to production queues.

3c. Topic Taxonomy

topic_categories (database)
  + fact_record_schemas (database)
  + topic_category_aliases (database)
    ├── apps/worker-facts (extraction schema selection + alias resolution)
    ├── apps/worker-validate (validation context)
    ├── apps/worker-facts (challenge generation + taxonomy voice)
    ├── apps/web/app/api/feed (category filtering)
    └── apps/web/components (category chips, filters)

What: The topic_categories table defines the hierarchical taxonomy (sports → basketball → NBA). Each category links to a fact_record_schemas row that defines what structured fields a fact in that category must have. topic_category_aliases maps external news API slugs to internal categories. Subcategory schemas auto-inherit from parent via trg_inherit_parent_schema trigger.

Why it matters: Adding a new topic category requires a migration, schema definition, and propagation to challenge formats. Removing or renaming a category breaks extraction for that topic. The taxonomy voice layer and content rules depend on slug-based lookups.

3d. AI Model Router + Adapters

packages/ai/src/model-router.ts
  + packages/ai/src/models/registry.ts
  + packages/ai/src/models/adapters/*.ts (12 adapters)
    ├── Fact extraction (worker-facts)
    ├── Fact validation (worker-validate)
    ├── Evergreen generation (worker-facts)
    ├── Challenge content generation (worker-facts)
    ├── Seed explosion (worker-facts)
    └── Conversational challenges (apps/web API)

What: The model router selects which AI model handles each call based on a three-tier system: default (92% of calls — cost-efficient), mid (5% — higher quality), high (1% — top-tier reasoning). Tier-to-model mapping is database-driven via ai_model_tier_config with 60-second caching. Each model has a ModelAdapter that injects per-model prompt optimizations (suffix/prefix/override modes) to exploit strengths and mitigate weaknesses.

6 AI Providers:

Provider	Models	Status
OpenAI	gpt-5-mini, gpt-5-nano, gpt-4o-mini	Active
Anthropic	claude-haiku-4-5, claude-opus-4-6	Active
Google	gemini-2.5-flash, gemini-2.0-flash-lite, gemini-3-flash-preview	Active
xAI	grok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning, grok-4	Active
Mistral	mistral-large-latest, mistral-medium-latest, mistral-small-latest	Active (no adapter)
DeepSeek	—	Removed

Why it matters: If the configured model's API key is missing or the provider is down, AI operations fall back to the default tier. Budget caps ($5/day Anthropic, $3/day Google) provide cost protection with graceful degradation.

3e. Validation Pipeline

worker-validate
  └── fact_records.status = 'validated'
        └── GENERATE_CHALLENGE_CONTENT (post-validation trigger)
              └── fact_challenge_content
                    └── /api/feed (only shows validated facts with challenges)
                          └── User's feed

What: The validation pipeline is the gate between extraction and the user's feed. Only facts with status = 'validated' appear in the feed. Challenge content is generated after validation passes.

Why it matters: If worker-validate is down or all validations fail, new facts accumulate in pending status. The feed doesn't break — it just stops showing new content. Existing validated facts continue to display.

3f. Subscription Gating

Stripe webhooks → user_subscriptions
  └── /api/cards/[slug] (subscription check)
        └── Card detail access

What: Stripe webhook events update user_subscriptions. The card detail API checks subscription status before returning gated content. 14-day trial with CC collection via Stripe Checkout.

3g. Challenge Content Voice Stack

CHALLENGE_VOICE_CONSTITUTION (universal)
  + TAXONOMY_VOICE (per-domain, 33+ categories)
    + STYLE_VOICE (per-format, 6 voices)
      + STYLE_RULES (tease-and-hint architecture)
        + ModelAdapter (per-model prompt optimization)
          + Drift coordinators (5 pluggable quality checks)
            └── Generated challenge content

What: Challenge content quality depends on a layered voice stack that builds from universal principles down to model-specific adaptations. The drift coordinator system (structure, schema, voice, taxonomy, difficulty) detects semantic drift in generated content.

Why it matters: Removing or modifying a voice layer without updating downstream layers creates quality drift. The taxonomy voice layer depends on slug-based lookups — any slug changes must propagate. ModelAdapter eligibility is tracked via JSONL with a 97% structural / 90% subjective threshold system (tiered eligibility).

Summary Matrix

System	Blast Radius	If It Breaks...
Shared Schemas	Total	Every package fails to compile
Queue System	Total	All async processing stops
Topic Taxonomy	High	New facts can't be extracted for affected topics
AI Model Router + Adapters	High	All AI falls to default tier or fails entirely
Validation Pipeline	Medium	Facts queue up but don't reach feed
Subscription Gating	Medium	Paying users can't access card details
Challenge Voice Stack	Medium	Challenge quality degrades silently

4. The Agents — Who Owns What

Think of it like... a hospital with specialized departments and a chief of staff. Each agent owns a specific domain, and the architect-steward (chief of staff) ensures they all work together without stepping on each other.

Eko uses a system of 17 specialized Claude Code agents (plus 5 deprecated) to prevent scope creep and ensure clear ownership.

Pipeline Agents

These agents own the data flow from news to card:

ingest-engineer
  └── informs → fact-engineer
                  └── informs → validation-engineer
                                  └── informs → card-ux-designer

Agent	Owns	Key Files
ingest-engineer	News fetch, clustering, images	`apps/worker-ingest/**`
fact-engineer	AI extraction, evergreen, challenges, model adapters	`apps/worker-facts/`, `packages/ai/`
validation-engineer	Multi-tier verification	`apps/worker-validate/**`
card-ux-designer	Feed, card detail, quiz UI	`apps/web/app/feed/`, `packages/ui/`

Cross-Cutting Agents

Agent	Role
architect-steward	Enforces v2 invariants, routes to correct agent
security-reviewer	RLS, SSRF prevention, secrets handling
ci-quality-gatekeeper	CI stability, linting, typecheck, builds
db-migration-operator	Schema changes, migrations, RLS policies
queue-sre	Queue health, DLQ monitoring, backoff tuning
cron-scheduler	Cron route creation, scheduling, ingestion_runs
platform-config-owner	Environment config, runtime settings
observability-analyst	Structured logging, trace correlation
subscription-manager	Free/Eko+ plans, Stripe, entitlements
admin-operator	Admin dashboard, content moderation
docs-librarian	Documentation health, link integrity
release-manager	Versioning, changelogs, rollback plans

Quick Lookup: "I need to change X, which agent do I talk to?"

If you're changing...	Talk to...
A news provider adapter	ingest-engineer
AI extraction prompts or model adapters	fact-engineer
Validation logic	validation-engineer
Feed algorithm or card UI	card-ux-designer
Database schema	db-migration-operator
Queue configuration	queue-sre
Cron schedules	cron-scheduler
Environment variables	platform-config-owner
Stripe/billing	subscription-manager
Admin dashboard	admin-operator
Challenge voice/taxonomy rules	fact-engineer
Not sure?	architect-steward (routes you to the right agent)

5. Rules, Quality Gates, and CI

Think of it like... a building's fire code — some rules are alarms (CI blocks merge), some are sprinklers (pre-commit hooks catch issues), and some are inspections (advisory reviews).

The 7 Invariants

These are Eko's non-negotiable constraints:

ID	Invariant	What it means
INV-001	Fact-first	Facts are the atomic unit. Everything flows from structured, schema-validated facts.
INV-002	Verification before publication	No fact reaches the public feed without at least one validation tier pass.
INV-003	Source attribution	Every fact traces back to source articles and validation evidence.
INV-004	Schema conformance	Fact output must validate against `fact_record_schemas.fact_keys`.
INV-005	Cost-bounded AI	All AI calls have model routing, budget caps ($5/day Anthropic, $3/day Google), and cost tracking.
INV-006	Public feed / gated detail	Feed is public; full card detail and interactions require Free/Eko+ subscription.
INV-007	Topic balance	Daily quotas per topic category prevent content monoculture.

Tradeoff priority: When invariants conflict, prefer: correctness → auditability → safety → cost control.

What CI Checks

The bun run ci pipeline runs these checks in order:

docs:lint:strict — Frontmatter validation on all markdown files
docs:health — Documentation health score (≥95% threshold)
docs:binding-check — Code path references in doc frontmatter exist
prompts:check — All file paths in prompt code blocks exist on disk
agents:routing-check — No file ownership overlaps between agents
rules:check — Rules index is current
scripts:check — Script index is current
migrations:check — Migrations index is current
bible:check — Product bible references are accurate
taxonomy:completeness-check — Voice coverage, content rules, vocabulary depth across all 32 taxonomy slugs
lint — Biome linting across all packages
registry:check — UI component registry is valid
env:check-example — .env.example is complete
env:check-typos — No common env file typos
typecheck — TypeScript type checking
test — Vitest test suite

Pre-Commit Hooks

Biome lint + format on staged .ts, .tsx, .js, .jsx, .json, .md files
Plan governance — blocks commits that modify status: locked plan files

6. Operations — Crons, Workers, and Environment

Cron Schedule Overview

Scheduled in vercel.json (5):

Cron	Schedule	Status
payment-reminders	Daily 9AM UTC	active
payment-escalation	Daily 9AM UTC	active
account-anniversaries	Daily 9AM UTC	active
daily-cost-report	Daily 6AM UTC	active
monthly-usage-report	1st of month	deprecated (stub)

Not yet scheduled (8) — active code but no vercel.json entry (OPS-004):

Cron	Intended Schedule	Purpose
ingest-news	Every 15 min	Primary news pipeline trigger
cluster-sweep	Every hour	Cluster unclustered articles
generate-evergreen	Daily 3AM UTC	Generate timeless knowledge facts
validation-retry	Every 4 hours	Re-enqueue stuck validations
archive-content	Daily 2AM UTC	Promote/archive facts by engagement
topic-quotas	Daily 6AM UTC	Audit fact counts vs quotas
import-facts	Daily 4AM UTC	Stub for structured API imports
daily-digest	—	Deprecated stub

What happens if... crons don't fire? Currently all deployments target Vercel preview, and crons only run on production (OPS-005). This means no automated pipeline processing is active until production deployment. Manual triggering works via curl -X POST /api/cron/<name> -H "Authorization: Bearer $CRON_SECRET".

Worker Health

All workers expose a /health endpoint on port 8080, send heartbeats every 30 seconds, and use 2-minute lease durations. Workers implement graceful shutdown via abort signals. WORKER_CONCURRENCY env var controls parallel messages per queue type (default: 1, set 3-5 for seeding).

Environment Controls

Central config lives in packages/config/src/index.ts. Key controls:

Control	Default	Purpose
`AI_PROVIDER`	anthropic	Primary AI provider
`ANTHROPIC_DAILY_SPEND_CAP_USD`	$5.00	Daily budget cap — falls back to GPT-4o-mini when exhausted
`GOOGLE_DAILY_SPEND_CAP_USD`	$3.00	Daily budget cap for Gemini calls
`OPUS_ESCALATION_ENABLED`	false	Allow routing to Opus for top-1% complex tasks
`OPUS_MAX_DAILY_CALLS`	20	Hard cap on Opus invocations
`EVERGREEN_ENABLED`	false	Master switch for evergreen fact generation
`EVERGREEN_DAILY_QUOTA`	20	Max evergreen facts per day
`WORKER_CONCURRENCY`	1	Parallel message handlers per queue (set 3-5 for seeding)
`NOTABILITY_THRESHOLD`	0.6	Minimum score to retain a fact (0.0-1.0)

The Seeding Pipeline

New topic categories are bootstrapped with content using the seed pipeline (see Pipeline C in Section 1 for the full walkthrough):

Curated entries are generated (AI-assisted) for the topic
EXPLODE_CATEGORY_ENTRY queue messages expand each entry into 10-100 structured facts (file_seed + spinoff_discovery)
FIND_SUPER_FACTS discovers cross-entry correlations (ai_super_fact)
VALIDATE_FACT verifies the generated facts (same pipeline as news and evergreen)
GENERATE_CHALLENGE_CONTENT creates quiz content for each validated fact

The seed pipeline uses higher WORKER_CONCURRENCY (3-5) for throughput.

7. Feature Improvements — Where Eko Can Level Up

Pipeline

Schedule the 8 unscheduled crons (OPS-004) — highest priority operational gap. The news pipeline, clustering, evergreen generation, and validation retry all have working code but no scheduler entry.
Deploy to production (OPS-005) — crons only fire on Vercel production deployments. Currently on preview.
Remove deprecated stubs (OPS-001/002/003/006/007) — monthly-usage-report, daily-digest, twilio webhook, and SEND_SMS queue type are all dead code.
Add retry visibility — surface DLQ counts in admin dashboard alerts so operators know when messages are failing.

UX

Offline card access — cache validated facts for offline review.
Progress dashboard — show user's learning streaks, topic coverage, and spaced repetition stats.
Social sharing — share fact cards to social media with OG images.
Onboarding tutorial — guided first-time experience explaining challenges.
Entity browsing ✅ — /entity/[id] detail pages showing all FCGs for an entity (e.g., "Babe Ruth"), linked entities, and an /explore search/browse page. Cross-FCG title leak detection prevents sibling FCG titles from revealing each other's challenge answers.

AI & Content Quality

Feedback loop — use user dispute data to improve extraction prompts.
Multi-language facts — extract facts in multiple languages for broader audience.
Difficulty calibration — use answer accuracy data to auto-calibrate challenge difficulty.
Expand model adapter coverage — Mistral adapters, further Gemini optimization.

Operations

Alerting integration — connect Sentry/PagerDuty for worker failures and budget overruns.
Queue dashboard improvements — real-time processing rates, historical throughput graphs.
Automated canary deploys — deploy to a subset of users before full rollout.

Business

Team plans — enable shared Eko+ accounts for classrooms and organizations.

8. Glossary

Term	Definition
Fact record	The atomic unit of Eko — a structured, schema-validated piece of knowledge with key-value facts, a title, and a notability score
Entity	The real-world subject driving an FCG — a person, place, event, or concept (e.g., "Babe Ruth", "2008 Global Financial Crisis"). Stored as `seed_entry_queue` entries, linked to facts via `fact_records.seed_entry_id`. Entities have detail pages (`/entity/[id]`) showing all their FCGs.
Topic category	A node in the hierarchical taxonomy (e.g., sports → basketball → NBA). Each has its own fact schema. 33+ root categories, 76+ subcategories.
Schema key	A typed field definition in `fact_record_schemas.fact_keys` (e.g., `player_name: text`, `career_points: number`) — domain-specific per category
Notability score	A 0.0-1.0 score indicating how noteworthy a fact is. Below the threshold (default 0.6), facts are discarded.
Challenge format	One of 8 named quiz formats (Big Fan Of, Know A Lot About, Repeat After Me, Good With Dates, Degrees of Separation, Used To Work There, Partial Pictures, Originators)
Challenge style	The UI mechanic for a challenge: fill_the_gap, direct_question, statement_blank, reverse_lookup, free_text, multiple_choice, progressive_image_reveal, or conversational
Challenge title	A theatrical, per-challenge title generated to avoid answer-leak bugs
Spaced repetition	SM-2 variant algorithm scheduling review intervals [4h, 1d, 3d, 7d, 14d, 30d] based on answer streak
DLQ	Dead-letter queue — where messages go after 3 failed processing attempts. Requires manual inspection.
Cron	A scheduled task triggered at fixed intervals by Vercel's cron system (production only)
Queue message	A Zod-validated JSON payload sent via Upstash Redis to trigger async work in a worker
Validation tier	One of four phases: structural → internal_consistency → cross_model → evidence_corroboration
Model tier	AI model quality level: default (cheap/fast), mid (balanced), high (top-tier reasoning)
ModelAdapter	Per-model prompt customization with suffix/prefix/override modes and eligibility tracking (97% structural / 90% subjective threshold)
Drift coordinator	Pluggable quality checker (structure, schema, voice, taxonomy, difficulty) that detects semantic drift in AI-generated challenge content
Taxonomy voice	Per-domain emotional register injected between universal voice and per-format voice in challenge generation
Taxonomy content rules	Per-domain formatting and factual conventions injected into extraction and challenge prompts
Domain vocabulary	Per-category expert terms and phrases auto-generated via taxonomy CLI
Evergreen fact	A co-equal content pillar alongside news facts — timeless knowledge not tied to current events (source_type `ai_generated`). Generated daily via the `GENERATE_EVERGREEN` queue at `mid` model tier for quality.
News fact	A fact derived from clustered news articles (source_type `news_extraction`). Tied to current events and extracted via the ingestion → clustering → extraction pipeline.
Seed fact	A fact generated during topic bootstrapping (source_type `file_seed` or `spinoff_discovery`). Created by "exploding" curated entries into structured facts.
Super fact	A cross-entry correlation discovered by comparing facts across multiple seed entries (source_type `ai_super_fact`)
Source type	The origin of a fact record: `news_extraction`, `ai_generated`, `file_seed`, `spinoff_discovery`, `ai_super_fact`, or `api_import`
Story cluster	A group of news articles about the same event, clustered by TF-IDF cosine similarity
Seed pipeline	The process of bootstrapping a new topic category with AI-generated facts from curated entries
Blast radius	How many systems are affected when a component breaks
Worker	A Bun-based background process that consumes queue messages and processes them
Alias resolution	3-step category lookup: exact slug → provider-specific alias → universal alias
Contamination detection	Entity-name sanity check on AI validation responses to prevent cross-model response mixing

References

App Control Manifest — Operational details for every cron, worker, queue, and API
Agent Catalog — Full agent system with ownership boundaries
Rules Index — All rules, conventions, and enforcement levels
Seed Control — Seeding pipeline directives and cost estimates
Model Code Isolation — ModelAdapter pattern and per-model prompt optimization

#Eko Product Bible

#1. The Pipeline — How a Fact Is Born

#Pipeline A: News Facts (source_type = news_extraction)

#1a. Ingestion

#1b. Clustering

#1c. Extraction

#Pipeline B: Evergreen Facts (source_type = ai_generated)

#1d. Evergreen Generation

#Pipeline C: Seed Facts (source_types = file_seed, spinoff_discovery, ai_super_fact)

#1e. Seed Explosion

#1f. Super Fact Discovery

#Shared Pipeline: Validation → Image → Challenge → Feed

#1g. Validation

#1h. Image Resolution

#1i. Challenge Generation

#1j. Feed Display

#2. The Architecture — Apps, Packages, and How They Connect

#Apps

#Packages

#Queue System

#Database

#3. Cascading Effects — What Breaks What

#3a. Shared Schemas

#3b. Queue System

#3c. Topic Taxonomy

#3d. AI Model Router + Adapters

#3e. Validation Pipeline

#3f. Subscription Gating

#3g. Challenge Content Voice Stack

#Summary Matrix

#4. The Agents — Who Owns What

#Pipeline Agents

#Cross-Cutting Agents

#Quick Lookup: "I need to change X, which agent do I talk to?"

#5. Rules, Quality Gates, and CI

#The 7 Invariants

#What CI Checks

#Pre-Commit Hooks

#6. Operations — Crons, Workers, and Environment

#Cron Schedule Overview

#Worker Health

#Environment Controls

#The Seeding Pipeline

#7. Feature Improvements — Where Eko Can Level Up

#Pipeline

#UX

#AI & Content Quality

#Operations

#Business

#8. Glossary

#References