Eko Product Bible

The "read this first" document for anyone joining the Eko team — engineers, designers, marketers, and business stakeholders.

Eko is a knowledge platform that builds verified, structured fact cards from multiple sources — breaking news, AI-generated evergreen knowledge, and curated seed content. Users learn through interactive challenges — quizzes, recall exercises, and conversational AI sessions — powered by spaced repetition. Think of it like a factory with three assembly lines: one processes raw news, one generates timeless knowledge, and one bootstraps new topic areas — all producing the same high-quality, verified knowledge cards.

The core loop: sources → facts → validation → cards → learning.

Three primary content pipelines feed Eko:

PipelineSource TypeWhat it producesTrigger
Newsnews_extractionFacts derived from clustered news articlesCron-driven (every 15 min)
Evergreenai_generatedTimeless knowledge facts not tied to current eventsCron-driven (daily)
Seedfile_seed, spinoff_discovery, ai_super_factBootstrapped facts for new topic categoriesManual / on-demand

All three pipelines converge at the same point: every fact goes through validation, image resolution, and challenge generation before reaching the feed.

Who uses Eko and why:

AudienceWhat they get
End usersA daily feed of verified knowledge cards with quizzes, recall, and AI challenges
Content teamSeeding tools to bootstrap new topic categories with high-quality facts
EngineersA well-structured pipeline with clear ownership, CI enforcement, and specialized agents
BusinessSubscription-gated detail pages (Free tier = feed; Eko+ = full card detail and interactions)

1. The Pipeline — How a Fact Is Born

Facts enter Eko through three pipelines — news, evergreen, and seed — but all converge into the same validation → image → challenge → feed path. Here is the full picture.

News APIs ─┐
           ├──▶ [INGEST_NEWS] ──▶ worker-ingest ──▶ news_sources table
           │                                              │
           │                        ┌─────────────────────┘
           │                        ▼
           │              [CLUSTER_STORIES] ──▶ worker-ingest ──▶ story_clusters
           │                                              │
           │                        ┌─────────────────────┘
           │                        ▼
           │              [EXTRACT_FACTS] ──▶ worker-facts ──▶ fact_records
           │                                              │
           │                  ┌───────────┬───────────────┘
           │                  ▼           ▼
           │         [VALIDATE_FACT]  [RESOLVE_IMAGE]
           │              │               │
           │              ▼               ▼
           │         worker-validate  worker-ingest
           │              │               │
           │              ▼               ▼
           │         fact verified    image cached
           │              │
           │              ▼
           │    [GENERATE_CHALLENGE_CONTENT] ──▶ worker-facts
           │              │
           │              ▼
           │    fact_challenge_content (6 styles × 5 difficulties)
           │
Seed Data ─┤
           ├──▶ [EXPLODE_CATEGORY_ENTRY] ──▶ worker-facts ──▶ fact_records
           ├──▶ [FIND_SUPER_FACTS] ──▶ worker-facts ──▶ cross-correlations
           └──▶ [GENERATE_CHALLENGE_CONTENT] ──▶ worker-facts ──▶ challenge_content

Evergreen ────▶ [GENERATE_EVERGREEN] ──▶ worker-facts ──▶ fact_records

Pipeline A: News Facts (source_type = news_extraction)

Current-events facts derived from real-time news articles.

1a. Ingestion

Raw articles are fetched from news APIs (NewsAPI, GNews, TheNewsAPI) and stored in news_sources.

  • Queue: INGEST_NEWS → consumed by worker-ingest
  • Trigger: cron/ingest-news (intended every 15 minutes)
  • What happens: The cron dispatches one queue message per provider × active root-level topic category (queried with maxDepth: 0 to prevent quota explosion when subcategories exist). The worker fetches articles, deduplicates by URL and content_hash, and inserts into news_sources.

What happens if... the news API is down? The queue message fails, backs off exponentially (5s → 30s → 60s cap, 15% jitter), and retries up to 3 times. After 3 failures it moves to the dead-letter queue (DLQ). No data loss — the next cron cycle dispatches fresh messages.

1b. Clustering

Recent articles are grouped into stories using TF-IDF cosine similarity.

  • Queue: CLUSTER_STORIES → consumed by worker-ingest
  • Trigger: cron/cluster-sweep (intended hourly)
  • What happens: Unclustered articles older than 1 hour are batched and clustered. Similar articles get grouped into a single story_clusters row, which becomes the input for fact extraction.

1c. Extraction

AI extracts structured facts from story clusters.

  • Queue: EXTRACT_FACTS → consumed by worker-facts
  • Trigger: Automatic after clustering
  • What happens: The AI model (routed by the model router via ai_model_tier_config — see Section 4) reads the clustered articles (capped at 5 sources × 1,500 chars each to control prompt tokens) and produces structured fact records. Each fact has a title, key-value facts (validated against fact_record_schemas.fact_keys), a notability score, narrative context (Hook→Story→Connection, 4-8 sentences), and a theatrical challenge title. Facts are inserted with source_type = 'news_extraction' and linked to their source story. Per-model ModelAdapters inject prompt optimizations to exploit model strengths and mitigate weaknesses.
  • Category Resolution: Topic categories are resolved via resolveTopicCategory() — a 3-step alias fallback (exact slug match → provider-specific alias in topic_category_aliases → universal alias). Unresolved slugs are logged to unmapped_category_log for audit.

Pipeline B: Evergreen Facts (source_type = ai_generated)

Timeless knowledge facts not tied to current events — the kind of content that stays accurate and interesting indefinitely. Examples: "The speed of light is 299,792,458 m/s", "The Eiffel Tower was originally intended to be temporary." Evergreen facts are a co-equal content pillar alongside news facts; they ensure the feed always has high-quality content even when news cycles are slow.

1d. Evergreen Generation

AI generates structured facts for a given topic category, deduplicated against existing titles (capped at 50 to control prompt token growth).

  • Queue: GENERATE_EVERGREEN → consumed by worker-facts
  • Trigger: cron/generate-evergreen (intended daily at 3AM UTC)
  • Model tier: mid (higher quality than default, because long-lived content quality matters more)
  • What happens:
    1. The cron dispatches one message per active topic category with a count (default 20/day controlled by EVERGREEN_DAILY_QUOTA)
    2. The handler fetches existing fact titles for the topic to prevent duplicates
    3. AI generates structured fact records using the topic's schema keys, taxonomy content rules, taxonomy voice, and domain vocabulary
    4. Each generated fact is inserted with source_type = 'ai_generated', status = 'pending_validation'
    5. Each fact is immediately enqueued for VALIDATE_FACT with the multi_phase strategy (same rigor as news facts)
  • Controls: EVERGREEN_ENABLED (master switch, default false), EVERGREEN_DAILY_QUOTA (max facts/day, default 20)
  • Cost tracking: Total AI cost is split evenly across generated records and stored per-record in generation_cost_usd

What happens if... evergreen generation is disabled? The feed still works — it draws from existing validated facts, news-derived content, and spaced repetition reviews. Evergreen is additive, not required.

Pipeline C: Seed Facts (source_types = file_seed, spinoff_discovery, ai_super_fact)

New topic categories start empty. The seed pipeline bootstraps them with high-quality AI-generated content from curated entries.

1e. Seed Explosion

Curated seed entries are "exploded" into many structured facts.

  • Queue: EXPLODE_CATEGORY_ENTRY → consumed by worker-facts
  • Trigger: Manual via seed scripts (on-demand)
  • What happens: Each curated entry (e.g., a notable person, event, or concept) is expanded into 10-100 structured facts with theatrical titles and rich narrative context. Primary facts get source_type = 'file_seed'; discovered tangential facts get source_type = 'spinoff_discovery'. All are enqueued for validation via IMPORT_FACTS. The seed pipeline receives full taxonomy context — content rules, voice, and domain vocabulary — for domain-aware generation.

1f. Super Fact Discovery

AI finds cross-entry correlations — facts that connect multiple seed entries. Entry summaries are populated with actual fact titles (3 per entry) for real signal.

  • Queue: FIND_SUPER_FACTS → consumed by worker-facts
  • Trigger: After seed explosion completes for a batch
  • What happens: The AI compares facts across entries to find meaningful connections (e.g., "Both X and Y studied at the same university"). Super facts are inserted with source_type = 'ai_super_fact' and linked to their parent entries via super_fact_links.

Shared Pipeline: Validation → Image → Challenge → Feed

All three pipelines converge here. Every fact — regardless of source — goes through the same quality gates.

1g. Validation

Every fact goes through multi-phase verification before it reaches the feed.

  • Queue: VALIDATE_FACT → consumed by worker-validate
  • Trigger: Automatic after extraction/generation, plus cron/validation-retry every 4 hours for stuck facts
  • What happens: Four validation phases run in sequence:
PhaseNameWhat It DoesCost
1StructuralSchema conformance, type validation, injection detection$0 (code-only)
2ConsistencyInternal contradictions, taxonomy rule violations$0 (code-only)
3Cross-ModelAI adversarial verification via Gemini 2.5 Flash with recency-aware severity calibration~$0.001
4EvidenceExternal API corroboration (Wikipedia, Wikidata) + AI reasoner (Gemini 2.5 Flash)~$0.002-0.005

Phases 1-2 are free code-only checks that catch ~40% of defective facts before any AI call is made. Phase 3 uses severity calibration (defaults to info for simplifications, warning for material errors) and recency-aware rules for news articles (unverifiable-due-to-recency = info, lower pass threshold 0.35 vs 0.50). Phase 4 uses multi-strategy Wikipedia entity extraction (possessives, quoted names, proper nouns, topic path hints, MediaWiki search fallback) achieving ~85% lookup success.

  • Validation strategy varies by source: News and AI-generated facts use multi_phase. API imports use authoritative_api. Manual entries use curated_database.
  • Contamination detection: isResponseContaminated() performs entity-name sanity checks with automatic retry on cross-model and evidence phases.
  • Graduated penalties: likely_inaccurate gets -0.15 confidence penalty (not hard fail); schema_mismatch warnings are filtered from evidence escalation triggers.

What happens if... validation fails? The fact stays in pending status and never reaches the public feed. The validation-retry cron re-enqueues stuck facts every 4 hours. After 3 failed attempts the message goes to DLQ for manual inspection.

1h. Image Resolution

Facts get images resolved through a priority cascade of free APIs.

  • Queue: RESOLVE_IMAGE → consumed by worker-ingest
  • Trigger: Automatic after extraction/generation
  • What happens: The worker searches through a priority cascade:
PrioritySourceCoverageCost
1Wikipedia PageImages~80% of named entitiesFree, no key
2TheSportsDBSports teams, athletesFree key
3UnsplashTopical photos (landscapes, abstract)Free key
4PexelsTopical photos (alternative pool)Free key
5nullUI shows placeholderN/A

1i. Challenge Generation

Pre-computed challenge content is generated for each validated fact. Challenge generation is triggered after validation passes (not before), avoiding wasted AI cost on rejected facts.

  • Queue: GENERATE_CHALLENGE_CONTENT → consumed by worker-facts
  • Trigger: Automatic after validation passes (enqueued from validate-fact.ts)
  • What happens: AI generates challenge content for 6 quiz styles, each at up to 5 difficulty levels, using a three-layer voice system:

Voice Stack (injected in order):

  1. CHALLENGE_VOICE_CONSTITUTION — Universal Eko voice (playful, curious, wonder-driven)
  2. TAXONOMY_VOICE — Per-domain emotional register (e.g., sports = energetic, history = contemplative) for 33+ taxonomies
  3. STYLE_VOICE — Per-format interaction mechanics (gallery guide, dinner companion, co-author, etc.)
  4. STYLE_RULES — Tease-and-hint architecture with ANCHOR→ESCALATION→WITHHOLD arcs

Challenge Styles:

  • fill_the_gap — Sentence with masked answer
  • direct_question — "What is the capital of France?"
  • statement_blank — "_____ is the capital of France"
  • multiple_choice — Question with 4 options
  • reverse_lookup — Given the answer, identify the subject
  • free_text — Open-ended question with AI grading

Each piece of content includes: setup_text, challenge_text, correct_answer, reveal_correct, reveal_wrong, and typed style_data.

Per-challenge titles: Each challenge gets its own theatrical title (moved from fact-level), preventing answer-leak bugs.

Quality enforcement:

  • Drift coordinator system (5 pluggable coordinators: structure, schema, voice, taxonomy, difficulty) detects semantic drift in generated content
  • CQ-002 validation ensures second-person address in reveals
  • Banned pattern detection blocks "trivia", "quiz", "easy one" in reveals
  • Post-generation patchers: patchPassiveVoice() (36 patterns), patchGenericReveals() (21 semantic families, ~75% coverage), patchTextbookRegister(), patchPunctuationSpacing()
  • ModelAdapter-specific guardrails (e.g., Gemini 2.5 Flash has 40+ banned reveal openings, factual accuracy guardrails v6)

1j. Feed Display

Validated facts from all pipelines appear in the user's feed with a blended algorithm.

  • No queue — served by the /api/feed API endpoint
  • Blend: 40% recent validated facts, 30% facts due for spaced repetition review, 20% evergreen facts, 10% random exploration
  • Gating: The feed itself is public. Full card detail and interactions (quiz, recall, challenges) require a Free or Eko+ subscription.
  • Source-agnostic: The feed algorithm doesn't distinguish between news, evergreen, and seed facts. Once validated, they're all equal.

2. The Architecture — Apps, Packages, and How They Connect

Think of it like... packages are LEGO bricks; apps are assembled kits. Each package does one thing well, and apps compose them into user-facing products.

Apps

AppURLPurpose
apps/webapp.eko.dayAuthenticated app — feed, card detail, challenges, account
apps/adminadmin.eko.dayAdmin dashboard — content moderation, queue monitoring, billing
apps/publiceko.dayPublic marketing site — home, pricing, features, about
apps/worker-ingestQueue consumer: news ingestion, story clustering, image resolution
apps/worker-factsQueue consumer: fact extraction, evergreen, challenges, seeding
apps/worker-validateQueue consumer: multi-phase fact validation

Deprecated workers (stubs only): apps/worker-reel-render, apps/worker-sms

Packages

┌─────────────────────────────────────────────────────┐
│                    apps layer                        │
│   web    admin    public    worker-*                  │
└────┬───────┬────────┬─────────┬─────────────────────┘
     │       │        │         │
     ▼       ▼        ▼         ▼
┌─────────────────────────────────────────────────────┐
│                 packages layer                       │
│                                                      │
│  shared ◄── schemas, types, utilities                │
│  config ◄── env vars, model registry, TS data files  │
│  db     ◄── Supabase client, Drizzle ORM, queries   │
│  ai     ◄── extraction, validation, model router,    │
│             model adapters, challenge content,        │
│             drift coordinators, taxonomy voice        │
│  queue  ◄── Upstash Redis queue client               │
│  email  ◄── Resend email templates                   │
│  stripe ◄── billing integration                      │
│  r2     ◄── Cloudflare R2 object storage             │
│  observability ◄── structured logging                │
│  ui     ◄── shadcn/ui components (authenticated)     │
│  ui-public ◄── public site components                │
│  reel-schemas ◄── video schema definitions           │
└─────────────────────────────────────────────────────┘

Queue System

Backend: Upstash Redis (REST API). Max 3 attempts before dead-letter queue. Exponential backoff with jitter (5s → 30s → 60s cap, 15% jitter).

Queue TypeConsumerStatusTrigger
INGEST_NEWSworker-ingestactivecron (every 15m)
CLUSTER_STORIESworker-ingestactivecron (hourly)
RESOLVE_IMAGEworker-ingestactivepost-extraction
EXTRACT_FACTSworker-factsactivepost-clustering
IMPORT_FACTSworker-factsstubcron (not active)
GENERATE_EVERGREENworker-factsactivecron (daily)
EXPLODE_CATEGORY_ENTRYworker-factsactiveseed pipeline
FIND_SUPER_FACTSworker-factsactiveseed pipeline
GENERATE_CHALLENGE_CONTENTworker-factsactivepost-validation
VALIDATE_FACTworker-validateactivepost-extraction + cron (4h retry)
SEND_SMSnonedeprecated

Database

Supabase (Postgres) with Row-Level Security. 135 migrations (0001-0135) across 7 phases.

Key concept tables:

TablePurpose
topic_categoriesHierarchical topic taxonomy (33+ root categories, 76+ subcategories) with alias resolution
topic_category_aliasesMaps external news API slugs to internal categories (3-step fallback)
fact_record_schemasPer-topic Zod-validated key definitions (fact_keys) — 33+ domain-specific schemas
fact_recordsThe atomic unit — one verified fact with structured key-value data
storiesClustered news articles that facts are extracted from
news_sourcesRaw articles fetched from news APIs
card_interactionsUser engagement (views, answers, bookmarks, shares) with continuous 0.0-1.0 scoring
fact_challenge_contentPre-generated AI challenge text per style, difficulty, and target_fact_key
challenge_formats8 named challenge formats (Big Fan Of, Know A Lot About, etc.)
challenge_format_stylesStyle-to-format junction (which styles belong to which formats)
challenge_format_topicsTopic-to-format eligibility junction
challenge_sessionsMulti-turn conversational AI challenge state
user_subscriptionsFree/Eko+ subscription status
ai_cost_logPer-call AI spend tracking for budget enforcement
ai_cost_trackingDaily cost aggregation by provider, model, and feature
ai_model_tier_configDB-driven model tier routing (changeable via SQL, no restart)
score_disputesAI-judged score disputes with decision types
reward_milestones / user_reward_claimsEngagement rewards (100/500/1000/2000 points → free Eko+ days)
seed_entry_queuePriority-ordered seed entry consumption
super_fact_linksCross-entry correlation junction for super facts
unmapped_category_logAudit log for unresolved news API category slugs

3. Cascading Effects — What Breaks What

Eko has seven critical dependency chains. Understanding these prevents accidental breakage.

3a. Shared Schemas

packages/shared/src/schemas.ts
  ├── packages/queue (message validation)
  ├── packages/ai (extraction output schemas)
  ├── packages/db (query type safety)
  ├── apps/web (API request/response validation)
  ├── apps/admin (content display types)
  └── apps/worker-* (message parsing)

What: The @eko/shared package exports Zod schemas for every queue message type, every domain entity, and every API contract. It is imported by every other package and app.

Why it matters: A breaking change to a Zod schema (renaming a field, changing a type) cascades to every consumer. If INGEST_NEWS payload shape changes, packages/queue fails to validate, worker-ingest fails to parse, and the entire ingestion pipeline stops.

Safe change pattern: Add optional fields (non-breaking). For required field changes, update all consumers in the same PR and run bun run typecheck across the monorepo.

What happens if... a schema field is removed? Detection: bun run typecheck fails immediately across dependent packages. Recovery: Revert the change or update all consumers before merging.

3b. Queue System

packages/queue/src/index.ts
  ├── apps/worker-ingest (3 queue types)
  ├── apps/worker-facts (6 queue types)
  ├── apps/worker-validate (1 queue type)
  └── apps/web/app/api/cron/* (enqueue messages)

What: The queue package provides enqueue(), dequeue(), ack(), and nack() operations backed by Upstash Redis. Every cron and every worker depends on it.

Why it matters: If the queue system is misconfigured (bad Redis URL, schema mismatch), all async processing stops. Facts pile up unprocessed, validation stalls, and the feed goes stale.

Safe change pattern: Test queue changes with SOAK_QUEUE_SUFFIX for isolated testing before deploying to production queues.

3c. Topic Taxonomy

topic_categories (database)
  + fact_record_schemas (database)
  + topic_category_aliases (database)
    ├── apps/worker-facts (extraction schema selection + alias resolution)
    ├── apps/worker-validate (validation context)
    ├── apps/worker-facts (challenge generation + taxonomy voice)
    ├── apps/web/app/api/feed (category filtering)
    └── apps/web/components (category chips, filters)

What: The topic_categories table defines the hierarchical taxonomy (sports → basketball → NBA). Each category links to a fact_record_schemas row that defines what structured fields a fact in that category must have. topic_category_aliases maps external news API slugs to internal categories. Subcategory schemas auto-inherit from parent via trg_inherit_parent_schema trigger.

Why it matters: Adding a new topic category requires a migration, schema definition, and propagation to challenge formats. Removing or renaming a category breaks extraction for that topic. The taxonomy voice layer and content rules depend on slug-based lookups.

3d. AI Model Router + Adapters

packages/ai/src/model-router.ts
  + packages/ai/src/models/registry.ts
  + packages/ai/src/models/adapters/*.ts (12 adapters)
    ├── Fact extraction (worker-facts)
    ├── Fact validation (worker-validate)
    ├── Evergreen generation (worker-facts)
    ├── Challenge content generation (worker-facts)
    ├── Seed explosion (worker-facts)
    └── Conversational challenges (apps/web API)

What: The model router selects which AI model handles each call based on a three-tier system: default (92% of calls — cost-efficient), mid (5% — higher quality), high (1% — top-tier reasoning). Tier-to-model mapping is database-driven via ai_model_tier_config with 60-second caching. Each model has a ModelAdapter that injects per-model prompt optimizations (suffix/prefix/override modes) to exploit strengths and mitigate weaknesses.

6 AI Providers:

ProviderModelsStatus
OpenAIgpt-5-mini, gpt-5-nano, gpt-4o-miniActive
Anthropicclaude-haiku-4-5, claude-opus-4-6Active
Googlegemini-2.5-flash, gemini-2.0-flash-lite, gemini-3-flash-previewActive
xAIgrok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning, grok-4Active
Mistralmistral-large-latest, mistral-medium-latest, mistral-small-latestActive (no adapter)
DeepSeekRemoved

Why it matters: If the configured model's API key is missing or the provider is down, AI operations fall back to the default tier. Budget caps ($5/day Anthropic, $3/day Google) provide cost protection with graceful degradation.

3e. Validation Pipeline

worker-validate
  └── fact_records.status = 'validated'
        └── GENERATE_CHALLENGE_CONTENT (post-validation trigger)
              └── fact_challenge_content
                    └── /api/feed (only shows validated facts with challenges)
                          └── User's feed

What: The validation pipeline is the gate between extraction and the user's feed. Only facts with status = 'validated' appear in the feed. Challenge content is generated after validation passes.

Why it matters: If worker-validate is down or all validations fail, new facts accumulate in pending status. The feed doesn't break — it just stops showing new content. Existing validated facts continue to display.

3f. Subscription Gating

Stripe webhooks → user_subscriptions
  └── /api/cards/[slug] (subscription check)
        └── Card detail access

What: Stripe webhook events update user_subscriptions. The card detail API checks subscription status before returning gated content. 14-day trial with CC collection via Stripe Checkout.

3g. Challenge Content Voice Stack

CHALLENGE_VOICE_CONSTITUTION (universal)
  + TAXONOMY_VOICE (per-domain, 33+ categories)
    + STYLE_VOICE (per-format, 6 voices)
      + STYLE_RULES (tease-and-hint architecture)
        + ModelAdapter (per-model prompt optimization)
          + Drift coordinators (5 pluggable quality checks)
            └── Generated challenge content

What: Challenge content quality depends on a layered voice stack that builds from universal principles down to model-specific adaptations. The drift coordinator system (structure, schema, voice, taxonomy, difficulty) detects semantic drift in generated content.

Why it matters: Removing or modifying a voice layer without updating downstream layers creates quality drift. The taxonomy voice layer depends on slug-based lookups — any slug changes must propagate. ModelAdapter eligibility is tracked via JSONL with a 97% structural / 90% subjective threshold system (tiered eligibility).

Summary Matrix

SystemBlast RadiusIf It Breaks...
Shared SchemasTotalEvery package fails to compile
Queue SystemTotalAll async processing stops
Topic TaxonomyHighNew facts can't be extracted for affected topics
AI Model Router + AdaptersHighAll AI falls to default tier or fails entirely
Validation PipelineMediumFacts queue up but don't reach feed
Subscription GatingMediumPaying users can't access card details
Challenge Voice StackMediumChallenge quality degrades silently

4. The Agents — Who Owns What

Think of it like... a hospital with specialized departments and a chief of staff. Each agent owns a specific domain, and the architect-steward (chief of staff) ensures they all work together without stepping on each other.

Eko uses a system of 17 specialized Claude Code agents (plus 5 deprecated) to prevent scope creep and ensure clear ownership.

Pipeline Agents

These agents own the data flow from news to card:

ingest-engineer
  └── informs → fact-engineer
                  └── informs → validation-engineer
                                  └── informs → card-ux-designer
AgentOwnsKey Files
ingest-engineerNews fetch, clustering, imagesapps/worker-ingest/**
fact-engineerAI extraction, evergreen, challenges, model adaptersapps/worker-facts/**, packages/ai/**
validation-engineerMulti-tier verificationapps/worker-validate/**
card-ux-designerFeed, card detail, quiz UIapps/web/app/feed/**, packages/ui/**

Cross-Cutting Agents

AgentRole
architect-stewardEnforces v2 invariants, routes to correct agent
security-reviewerRLS, SSRF prevention, secrets handling
ci-quality-gatekeeperCI stability, linting, typecheck, builds
db-migration-operatorSchema changes, migrations, RLS policies
queue-sreQueue health, DLQ monitoring, backoff tuning
cron-schedulerCron route creation, scheduling, ingestion_runs
platform-config-ownerEnvironment config, runtime settings
observability-analystStructured logging, trace correlation
subscription-managerFree/Eko+ plans, Stripe, entitlements
admin-operatorAdmin dashboard, content moderation
docs-librarianDocumentation health, link integrity
release-managerVersioning, changelogs, rollback plans

Quick Lookup: "I need to change X, which agent do I talk to?"

If you're changing...Talk to...
A news provider adapteringest-engineer
AI extraction prompts or model adaptersfact-engineer
Validation logicvalidation-engineer
Feed algorithm or card UIcard-ux-designer
Database schemadb-migration-operator
Queue configurationqueue-sre
Cron schedulescron-scheduler
Environment variablesplatform-config-owner
Stripe/billingsubscription-manager
Admin dashboardadmin-operator
Challenge voice/taxonomy rulesfact-engineer
Not sure?architect-steward (routes you to the right agent)

5. Rules, Quality Gates, and CI

Think of it like... a building's fire code — some rules are alarms (CI blocks merge), some are sprinklers (pre-commit hooks catch issues), and some are inspections (advisory reviews).

The 7 Invariants

These are Eko's non-negotiable constraints:

IDInvariantWhat it means
INV-001Fact-firstFacts are the atomic unit. Everything flows from structured, schema-validated facts.
INV-002Verification before publicationNo fact reaches the public feed without at least one validation tier pass.
INV-003Source attributionEvery fact traces back to source articles and validation evidence.
INV-004Schema conformanceFact output must validate against fact_record_schemas.fact_keys.
INV-005Cost-bounded AIAll AI calls have model routing, budget caps ($5/day Anthropic, $3/day Google), and cost tracking.
INV-006Public feed / gated detailFeed is public; full card detail and interactions require Free/Eko+ subscription.
INV-007Topic balanceDaily quotas per topic category prevent content monoculture.

Tradeoff priority: When invariants conflict, prefer: correctness → auditability → safety → cost control.

What CI Checks

The bun run ci pipeline runs these checks in order:

  1. docs:lint:strict — Frontmatter validation on all markdown files
  2. docs:health — Documentation health score (≥95% threshold)
  3. docs:binding-check — Code path references in doc frontmatter exist
  4. prompts:check — All file paths in prompt code blocks exist on disk
  5. agents:routing-check — No file ownership overlaps between agents
  6. rules:check — Rules index is current
  7. scripts:check — Script index is current
  8. migrations:check — Migrations index is current
  9. bible:check — Product bible references are accurate
  10. taxonomy:completeness-check — Voice coverage, content rules, vocabulary depth across all 32 taxonomy slugs
  11. lint — Biome linting across all packages
  12. registry:check — UI component registry is valid
  13. env:check-example.env.example is complete
  14. env:check-typos — No common env file typos
  15. typecheck — TypeScript type checking
  16. test — Vitest test suite

Pre-Commit Hooks

  • Biome lint + format on staged .ts, .tsx, .js, .jsx, .json, .md files
  • Plan governance — blocks commits that modify status: locked plan files

6. Operations — Crons, Workers, and Environment

Cron Schedule Overview

Scheduled in vercel.json (5):

CronScheduleStatus
payment-remindersDaily 9AM UTCactive
payment-escalationDaily 9AM UTCactive
account-anniversariesDaily 9AM UTCactive
daily-cost-reportDaily 6AM UTCactive
monthly-usage-report1st of monthdeprecated (stub)

Not yet scheduled (8) — active code but no vercel.json entry (OPS-004):

CronIntended SchedulePurpose
ingest-newsEvery 15 minPrimary news pipeline trigger
cluster-sweepEvery hourCluster unclustered articles
generate-evergreenDaily 3AM UTCGenerate timeless knowledge facts
validation-retryEvery 4 hoursRe-enqueue stuck validations
archive-contentDaily 2AM UTCPromote/archive facts by engagement
topic-quotasDaily 6AM UTCAudit fact counts vs quotas
import-factsDaily 4AM UTCStub for structured API imports
daily-digestDeprecated stub

What happens if... crons don't fire? Currently all deployments target Vercel preview, and crons only run on production (OPS-005). This means no automated pipeline processing is active until production deployment. Manual triggering works via curl -X POST /api/cron/<name> -H "Authorization: Bearer $CRON_SECRET".

Worker Health

All workers expose a /health endpoint on port 8080, send heartbeats every 30 seconds, and use 2-minute lease durations. Workers implement graceful shutdown via abort signals. WORKER_CONCURRENCY env var controls parallel messages per queue type (default: 1, set 3-5 for seeding).

Environment Controls

Central config lives in packages/config/src/index.ts. Key controls:

ControlDefaultPurpose
AI_PROVIDERanthropicPrimary AI provider
ANTHROPIC_DAILY_SPEND_CAP_USD$5.00Daily budget cap — falls back to GPT-4o-mini when exhausted
GOOGLE_DAILY_SPEND_CAP_USD$3.00Daily budget cap for Gemini calls
OPUS_ESCALATION_ENABLEDfalseAllow routing to Opus for top-1% complex tasks
OPUS_MAX_DAILY_CALLS20Hard cap on Opus invocations
EVERGREEN_ENABLEDfalseMaster switch for evergreen fact generation
EVERGREEN_DAILY_QUOTA20Max evergreen facts per day
WORKER_CONCURRENCY1Parallel message handlers per queue (set 3-5 for seeding)
NOTABILITY_THRESHOLD0.6Minimum score to retain a fact (0.0-1.0)

The Seeding Pipeline

New topic categories are bootstrapped with content using the seed pipeline (see Pipeline C in Section 1 for the full walkthrough):

  1. Curated entries are generated (AI-assisted) for the topic
  2. EXPLODE_CATEGORY_ENTRY queue messages expand each entry into 10-100 structured facts (file_seed + spinoff_discovery)
  3. FIND_SUPER_FACTS discovers cross-entry correlations (ai_super_fact)
  4. VALIDATE_FACT verifies the generated facts (same pipeline as news and evergreen)
  5. GENERATE_CHALLENGE_CONTENT creates quiz content for each validated fact

The seed pipeline uses higher WORKER_CONCURRENCY (3-5) for throughput.


7. Feature Improvements — Where Eko Can Level Up

Pipeline

  1. Schedule the 8 unscheduled crons (OPS-004) — highest priority operational gap. The news pipeline, clustering, evergreen generation, and validation retry all have working code but no scheduler entry.
  2. Deploy to production (OPS-005) — crons only fire on Vercel production deployments. Currently on preview.
  3. Remove deprecated stubs (OPS-001/002/003/006/007) — monthly-usage-report, daily-digest, twilio webhook, and SEND_SMS queue type are all dead code.
  4. Add retry visibility — surface DLQ counts in admin dashboard alerts so operators know when messages are failing.

UX

  1. Offline card access — cache validated facts for offline review.
  2. Progress dashboard — show user's learning streaks, topic coverage, and spaced repetition stats.
  3. Social sharing — share fact cards to social media with OG images.
  4. Onboarding tutorial — guided first-time experience explaining challenges.
  5. Entity browsing ✅ — /entity/[id] detail pages showing all FCGs for an entity (e.g., "Babe Ruth"), linked entities, and an /explore search/browse page. Cross-FCG title leak detection prevents sibling FCG titles from revealing each other's challenge answers.

AI & Content Quality

  1. Feedback loop — use user dispute data to improve extraction prompts.
  2. Multi-language facts — extract facts in multiple languages for broader audience.
  3. Difficulty calibration — use answer accuracy data to auto-calibrate challenge difficulty.
  4. Expand model adapter coverage — Mistral adapters, further Gemini optimization.

Operations

  1. Alerting integration — connect Sentry/PagerDuty for worker failures and budget overruns.
  2. Queue dashboard improvements — real-time processing rates, historical throughput graphs.
  3. Automated canary deploys — deploy to a subset of users before full rollout.

Business

  1. Team plans — enable shared Eko+ accounts for classrooms and organizations.

8. Glossary

TermDefinition
Fact recordThe atomic unit of Eko — a structured, schema-validated piece of knowledge with key-value facts, a title, and a notability score
EntityThe real-world subject driving an FCG — a person, place, event, or concept (e.g., "Babe Ruth", "2008 Global Financial Crisis"). Stored as seed_entry_queue entries, linked to facts via fact_records.seed_entry_id. Entities have detail pages (/entity/[id]) showing all their FCGs.
Topic categoryA node in the hierarchical taxonomy (e.g., sports → basketball → NBA). Each has its own fact schema. 33+ root categories, 76+ subcategories.
Schema keyA typed field definition in fact_record_schemas.fact_keys (e.g., player_name: text, career_points: number) — domain-specific per category
Notability scoreA 0.0-1.0 score indicating how noteworthy a fact is. Below the threshold (default 0.6), facts are discarded.
Challenge formatOne of 8 named quiz formats (Big Fan Of, Know A Lot About, Repeat After Me, Good With Dates, Degrees of Separation, Used To Work There, Partial Pictures, Originators)
Challenge styleThe UI mechanic for a challenge: fill_the_gap, direct_question, statement_blank, reverse_lookup, free_text, multiple_choice, progressive_image_reveal, or conversational
Challenge titleA theatrical, per-challenge title generated to avoid answer-leak bugs
Spaced repetitionSM-2 variant algorithm scheduling review intervals [4h, 1d, 3d, 7d, 14d, 30d] based on answer streak
DLQDead-letter queue — where messages go after 3 failed processing attempts. Requires manual inspection.
CronA scheduled task triggered at fixed intervals by Vercel's cron system (production only)
Queue messageA Zod-validated JSON payload sent via Upstash Redis to trigger async work in a worker
Validation tierOne of four phases: structural → internal_consistency → cross_model → evidence_corroboration
Model tierAI model quality level: default (cheap/fast), mid (balanced), high (top-tier reasoning)
ModelAdapterPer-model prompt customization with suffix/prefix/override modes and eligibility tracking (97% structural / 90% subjective threshold)
Drift coordinatorPluggable quality checker (structure, schema, voice, taxonomy, difficulty) that detects semantic drift in AI-generated challenge content
Taxonomy voicePer-domain emotional register injected between universal voice and per-format voice in challenge generation
Taxonomy content rulesPer-domain formatting and factual conventions injected into extraction and challenge prompts
Domain vocabularyPer-category expert terms and phrases auto-generated via taxonomy CLI
Evergreen factA co-equal content pillar alongside news facts — timeless knowledge not tied to current events (source_type ai_generated). Generated daily via the GENERATE_EVERGREEN queue at mid model tier for quality.
News factA fact derived from clustered news articles (source_type news_extraction). Tied to current events and extracted via the ingestion → clustering → extraction pipeline.
Seed factA fact generated during topic bootstrapping (source_type file_seed or spinoff_discovery). Created by "exploding" curated entries into structured facts.
Super factA cross-entry correlation discovered by comparing facts across multiple seed entries (source_type ai_super_fact)
Source typeThe origin of a fact record: news_extraction, ai_generated, file_seed, spinoff_discovery, ai_super_fact, or api_import
Story clusterA group of news articles about the same event, clustered by TF-IDF cosine similarity
Seed pipelineThe process of bootstrapping a new topic category with AI-generated facts from curated entries
Blast radiusHow many systems are affected when a component breaks
WorkerA Bun-based background process that consumes queue messages and processes them
Alias resolution3-step category lookup: exact slug → provider-specific alias → universal alias
Contamination detectionEntity-name sanity check on AI validation responses to prevent cross-model response mixing

References