Eko Product Bible
The "read this first" document for anyone joining the Eko team — engineers, designers, marketers, and business stakeholders.
Eko is a knowledge platform that builds verified, structured fact cards from multiple sources — breaking news, AI-generated evergreen knowledge, and curated seed content. Users learn through interactive challenges — quizzes, recall exercises, and conversational AI sessions — powered by spaced repetition. Think of it like a factory with three assembly lines: one processes raw news, one generates timeless knowledge, and one bootstraps new topic areas — all producing the same high-quality, verified knowledge cards.
The core loop: sources → facts → validation → cards → learning.
Three primary content pipelines feed Eko:
| Pipeline | Source Type | What it produces | Trigger |
|---|---|---|---|
| News | news_extraction | Facts derived from clustered news articles | Cron-driven (every 15 min) |
| Evergreen | ai_generated | Timeless knowledge facts not tied to current events | Cron-driven (daily) |
| Seed | file_seed, spinoff_discovery, ai_super_fact | Bootstrapped facts for new topic categories | Manual / on-demand |
All three pipelines converge at the same point: every fact goes through validation, image resolution, and challenge generation before reaching the feed.
Who uses Eko and why:
| Audience | What they get |
|---|---|
| End users | A daily feed of verified knowledge cards with quizzes, recall, and AI challenges |
| Content team | Seeding tools to bootstrap new topic categories with high-quality facts |
| Engineers | A well-structured pipeline with clear ownership, CI enforcement, and specialized agents |
| Business | Subscription-gated detail pages (Free tier = feed; Eko+ = full card detail and interactions) |
1. The Pipeline — How a Fact Is Born
Facts enter Eko through three pipelines — news, evergreen, and seed — but all converge into the same validation → image → challenge → feed path. Here is the full picture.
News APIs ─┐
├──▶ [INGEST_NEWS] ──▶ worker-ingest ──▶ news_sources table
│ │
│ ┌─────────────────────┘
│ ▼
│ [CLUSTER_STORIES] ──▶ worker-ingest ──▶ story_clusters
│ │
│ ┌─────────────────────┘
│ ▼
│ [EXTRACT_FACTS] ──▶ worker-facts ──▶ fact_records
│ │
│ ┌───────────┬───────────────┘
│ ▼ ▼
│ [VALIDATE_FACT] [RESOLVE_IMAGE]
│ │ │
│ ▼ ▼
│ worker-validate worker-ingest
│ │ │
│ ▼ ▼
│ fact verified image cached
│ │
│ ▼
│ [GENERATE_CHALLENGE_CONTENT] ──▶ worker-facts
│ │
│ ▼
│ fact_challenge_content (6 styles × 5 difficulties)
│
Seed Data ─┤
├──▶ [EXPLODE_CATEGORY_ENTRY] ──▶ worker-facts ──▶ fact_records
├──▶ [FIND_SUPER_FACTS] ──▶ worker-facts ──▶ cross-correlations
└──▶ [GENERATE_CHALLENGE_CONTENT] ──▶ worker-facts ──▶ challenge_content
Evergreen ────▶ [GENERATE_EVERGREEN] ──▶ worker-facts ──▶ fact_records
Pipeline A: News Facts (source_type = news_extraction)
Current-events facts derived from real-time news articles.
1a. Ingestion
Raw articles are fetched from news APIs (NewsAPI, GNews, TheNewsAPI) and stored in news_sources.
- Queue:
INGEST_NEWS→ consumed byworker-ingest - Trigger:
cron/ingest-news(intended every 15 minutes) - What happens: The cron dispatches one queue message per provider × active root-level topic category (queried with
maxDepth: 0to prevent quota explosion when subcategories exist). The worker fetches articles, deduplicates by URL andcontent_hash, and inserts intonews_sources.
What happens if... the news API is down? The queue message fails, backs off exponentially (5s → 30s → 60s cap, 15% jitter), and retries up to 3 times. After 3 failures it moves to the dead-letter queue (DLQ). No data loss — the next cron cycle dispatches fresh messages.
1b. Clustering
Recent articles are grouped into stories using TF-IDF cosine similarity.
- Queue:
CLUSTER_STORIES→ consumed byworker-ingest - Trigger:
cron/cluster-sweep(intended hourly) - What happens: Unclustered articles older than 1 hour are batched and clustered. Similar articles get grouped into a single
story_clustersrow, which becomes the input for fact extraction.
1c. Extraction
AI extracts structured facts from story clusters.
- Queue:
EXTRACT_FACTS→ consumed byworker-facts - Trigger: Automatic after clustering
- What happens: The AI model (routed by the model router via
ai_model_tier_config— see Section 4) reads the clustered articles (capped at 5 sources × 1,500 chars each to control prompt tokens) and produces structured fact records. Each fact has a title, key-value facts (validated againstfact_record_schemas.fact_keys), a notability score, narrative context (Hook→Story→Connection, 4-8 sentences), and a theatrical challenge title. Facts are inserted withsource_type = 'news_extraction'and linked to their source story. Per-model ModelAdapters inject prompt optimizations to exploit model strengths and mitigate weaknesses. - Category Resolution: Topic categories are resolved via
resolveTopicCategory()— a 3-step alias fallback (exact slug match → provider-specific alias intopic_category_aliases→ universal alias). Unresolved slugs are logged tounmapped_category_logfor audit.
Pipeline B: Evergreen Facts (source_type = ai_generated)
Timeless knowledge facts not tied to current events — the kind of content that stays accurate and interesting indefinitely. Examples: "The speed of light is 299,792,458 m/s", "The Eiffel Tower was originally intended to be temporary." Evergreen facts are a co-equal content pillar alongside news facts; they ensure the feed always has high-quality content even when news cycles are slow.
1d. Evergreen Generation
AI generates structured facts for a given topic category, deduplicated against existing titles (capped at 50 to control prompt token growth).
- Queue:
GENERATE_EVERGREEN→ consumed byworker-facts - Trigger:
cron/generate-evergreen(intended daily at 3AM UTC) - Model tier:
mid(higher quality than default, because long-lived content quality matters more) - What happens:
- The cron dispatches one message per active topic category with a count (default 20/day controlled by
EVERGREEN_DAILY_QUOTA) - The handler fetches existing fact titles for the topic to prevent duplicates
- AI generates structured fact records using the topic's schema keys, taxonomy content rules, taxonomy voice, and domain vocabulary
- Each generated fact is inserted with
source_type = 'ai_generated',status = 'pending_validation' - Each fact is immediately enqueued for
VALIDATE_FACTwith themulti_phasestrategy (same rigor as news facts)
- The cron dispatches one message per active topic category with a count (default 20/day controlled by
- Controls:
EVERGREEN_ENABLED(master switch, default false),EVERGREEN_DAILY_QUOTA(max facts/day, default 20) - Cost tracking: Total AI cost is split evenly across generated records and stored per-record in
generation_cost_usd
What happens if... evergreen generation is disabled? The feed still works — it draws from existing validated facts, news-derived content, and spaced repetition reviews. Evergreen is additive, not required.
Pipeline C: Seed Facts (source_types = file_seed, spinoff_discovery, ai_super_fact)
New topic categories start empty. The seed pipeline bootstraps them with high-quality AI-generated content from curated entries.
1e. Seed Explosion
Curated seed entries are "exploded" into many structured facts.
- Queue:
EXPLODE_CATEGORY_ENTRY→ consumed byworker-facts - Trigger: Manual via seed scripts (on-demand)
- What happens: Each curated entry (e.g., a notable person, event, or concept) is expanded into 10-100 structured facts with theatrical titles and rich narrative context. Primary facts get
source_type = 'file_seed'; discovered tangential facts getsource_type = 'spinoff_discovery'. All are enqueued for validation viaIMPORT_FACTS. The seed pipeline receives full taxonomy context — content rules, voice, and domain vocabulary — for domain-aware generation.
1f. Super Fact Discovery
AI finds cross-entry correlations — facts that connect multiple seed entries. Entry summaries are populated with actual fact titles (3 per entry) for real signal.
- Queue:
FIND_SUPER_FACTS→ consumed byworker-facts - Trigger: After seed explosion completes for a batch
- What happens: The AI compares facts across entries to find meaningful connections (e.g., "Both X and Y studied at the same university"). Super facts are inserted with
source_type = 'ai_super_fact'and linked to their parent entries viasuper_fact_links.
Shared Pipeline: Validation → Image → Challenge → Feed
All three pipelines converge here. Every fact — regardless of source — goes through the same quality gates.
1g. Validation
Every fact goes through multi-phase verification before it reaches the feed.
- Queue:
VALIDATE_FACT→ consumed byworker-validate - Trigger: Automatic after extraction/generation, plus
cron/validation-retryevery 4 hours for stuck facts - What happens: Four validation phases run in sequence:
| Phase | Name | What It Does | Cost |
|---|---|---|---|
| 1 | Structural | Schema conformance, type validation, injection detection | $0 (code-only) |
| 2 | Consistency | Internal contradictions, taxonomy rule violations | $0 (code-only) |
| 3 | Cross-Model | AI adversarial verification via Gemini 2.5 Flash with recency-aware severity calibration | ~$0.001 |
| 4 | Evidence | External API corroboration (Wikipedia, Wikidata) + AI reasoner (Gemini 2.5 Flash) | ~$0.002-0.005 |
Phases 1-2 are free code-only checks that catch ~40% of defective facts before any AI call is made. Phase 3 uses severity calibration (defaults to info for simplifications, warning for material errors) and recency-aware rules for news articles (unverifiable-due-to-recency = info, lower pass threshold 0.35 vs 0.50). Phase 4 uses multi-strategy Wikipedia entity extraction (possessives, quoted names, proper nouns, topic path hints, MediaWiki search fallback) achieving ~85% lookup success.
- Validation strategy varies by source: News and AI-generated facts use
multi_phase. API imports useauthoritative_api. Manual entries usecurated_database. - Contamination detection:
isResponseContaminated()performs entity-name sanity checks with automatic retry on cross-model and evidence phases. - Graduated penalties:
likely_inaccurategets -0.15 confidence penalty (not hard fail);schema_mismatchwarnings are filtered from evidence escalation triggers.
What happens if... validation fails? The fact stays in
pendingstatus and never reaches the public feed. The validation-retry cron re-enqueues stuck facts every 4 hours. After 3 failed attempts the message goes to DLQ for manual inspection.
1h. Image Resolution
Facts get images resolved through a priority cascade of free APIs.
- Queue:
RESOLVE_IMAGE→ consumed byworker-ingest - Trigger: Automatic after extraction/generation
- What happens: The worker searches through a priority cascade:
| Priority | Source | Coverage | Cost |
|---|---|---|---|
| 1 | Wikipedia PageImages | ~80% of named entities | Free, no key |
| 2 | TheSportsDB | Sports teams, athletes | Free key |
| 3 | Unsplash | Topical photos (landscapes, abstract) | Free key |
| 4 | Pexels | Topical photos (alternative pool) | Free key |
| 5 | null | UI shows placeholder | N/A |
1i. Challenge Generation
Pre-computed challenge content is generated for each validated fact. Challenge generation is triggered after validation passes (not before), avoiding wasted AI cost on rejected facts.
- Queue:
GENERATE_CHALLENGE_CONTENT→ consumed byworker-facts - Trigger: Automatic after validation passes (enqueued from
validate-fact.ts) - What happens: AI generates challenge content for 6 quiz styles, each at up to 5 difficulty levels, using a three-layer voice system:
Voice Stack (injected in order):
CHALLENGE_VOICE_CONSTITUTION— Universal Eko voice (playful, curious, wonder-driven)TAXONOMY_VOICE— Per-domain emotional register (e.g., sports = energetic, history = contemplative) for 33+ taxonomiesSTYLE_VOICE— Per-format interaction mechanics (gallery guide, dinner companion, co-author, etc.)STYLE_RULES— Tease-and-hint architecture with ANCHOR→ESCALATION→WITHHOLD arcs
Challenge Styles:
fill_the_gap— Sentence with masked answerdirect_question— "What is the capital of France?"statement_blank— "_____ is the capital of France"multiple_choice— Question with 4 optionsreverse_lookup— Given the answer, identify the subjectfree_text— Open-ended question with AI grading
Each piece of content includes: setup_text, challenge_text, correct_answer, reveal_correct, reveal_wrong, and typed style_data.
Per-challenge titles: Each challenge gets its own theatrical title (moved from fact-level), preventing answer-leak bugs.
Quality enforcement:
- Drift coordinator system (5 pluggable coordinators: structure, schema, voice, taxonomy, difficulty) detects semantic drift in generated content
- CQ-002 validation ensures second-person address in reveals
- Banned pattern detection blocks "trivia", "quiz", "easy one" in reveals
- Post-generation patchers:
patchPassiveVoice()(36 patterns),patchGenericReveals()(21 semantic families, ~75% coverage),patchTextbookRegister(),patchPunctuationSpacing() - ModelAdapter-specific guardrails (e.g., Gemini 2.5 Flash has 40+ banned reveal openings, factual accuracy guardrails v6)
1j. Feed Display
Validated facts from all pipelines appear in the user's feed with a blended algorithm.
- No queue — served by the
/api/feedAPI endpoint - Blend: 40% recent validated facts, 30% facts due for spaced repetition review, 20% evergreen facts, 10% random exploration
- Gating: The feed itself is public. Full card detail and interactions (quiz, recall, challenges) require a Free or Eko+ subscription.
- Source-agnostic: The feed algorithm doesn't distinguish between news, evergreen, and seed facts. Once validated, they're all equal.
2. The Architecture — Apps, Packages, and How They Connect
Think of it like... packages are LEGO bricks; apps are assembled kits. Each package does one thing well, and apps compose them into user-facing products.
Apps
| App | URL | Purpose |
|---|---|---|
apps/web | app.eko.day | Authenticated app — feed, card detail, challenges, account |
apps/admin | admin.eko.day | Admin dashboard — content moderation, queue monitoring, billing |
apps/public | eko.day | Public marketing site — home, pricing, features, about |
apps/worker-ingest | — | Queue consumer: news ingestion, story clustering, image resolution |
apps/worker-facts | — | Queue consumer: fact extraction, evergreen, challenges, seeding |
apps/worker-validate | — | Queue consumer: multi-phase fact validation |
Deprecated workers (stubs only): apps/worker-reel-render, apps/worker-sms
Packages
┌─────────────────────────────────────────────────────┐
│ apps layer │
│ web admin public worker-* │
└────┬───────┬────────┬─────────┬─────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ packages layer │
│ │
│ shared ◄── schemas, types, utilities │
│ config ◄── env vars, model registry, TS data files │
│ db ◄── Supabase client, Drizzle ORM, queries │
│ ai ◄── extraction, validation, model router, │
│ model adapters, challenge content, │
│ drift coordinators, taxonomy voice │
│ queue ◄── Upstash Redis queue client │
│ email ◄── Resend email templates │
│ stripe ◄── billing integration │
│ r2 ◄── Cloudflare R2 object storage │
│ observability ◄── structured logging │
│ ui ◄── shadcn/ui components (authenticated) │
│ ui-public ◄── public site components │
│ reel-schemas ◄── video schema definitions │
└─────────────────────────────────────────────────────┘
Queue System
Backend: Upstash Redis (REST API). Max 3 attempts before dead-letter queue. Exponential backoff with jitter (5s → 30s → 60s cap, 15% jitter).
| Queue Type | Consumer | Status | Trigger |
|---|---|---|---|
INGEST_NEWS | worker-ingest | active | cron (every 15m) |
CLUSTER_STORIES | worker-ingest | active | cron (hourly) |
RESOLVE_IMAGE | worker-ingest | active | post-extraction |
EXTRACT_FACTS | worker-facts | active | post-clustering |
IMPORT_FACTS | worker-facts | stub | cron (not active) |
GENERATE_EVERGREEN | worker-facts | active | cron (daily) |
EXPLODE_CATEGORY_ENTRY | worker-facts | active | seed pipeline |
FIND_SUPER_FACTS | worker-facts | active | seed pipeline |
GENERATE_CHALLENGE_CONTENT | worker-facts | active | post-validation |
VALIDATE_FACT | worker-validate | active | post-extraction + cron (4h retry) |
SEND_SMS | none | deprecated | — |
Database
Supabase (Postgres) with Row-Level Security. 135 migrations (0001-0135) across 7 phases.
Key concept tables:
| Table | Purpose |
|---|---|
topic_categories | Hierarchical topic taxonomy (33+ root categories, 76+ subcategories) with alias resolution |
topic_category_aliases | Maps external news API slugs to internal categories (3-step fallback) |
fact_record_schemas | Per-topic Zod-validated key definitions (fact_keys) — 33+ domain-specific schemas |
fact_records | The atomic unit — one verified fact with structured key-value data |
stories | Clustered news articles that facts are extracted from |
news_sources | Raw articles fetched from news APIs |
card_interactions | User engagement (views, answers, bookmarks, shares) with continuous 0.0-1.0 scoring |
fact_challenge_content | Pre-generated AI challenge text per style, difficulty, and target_fact_key |
challenge_formats | 8 named challenge formats (Big Fan Of, Know A Lot About, etc.) |
challenge_format_styles | Style-to-format junction (which styles belong to which formats) |
challenge_format_topics | Topic-to-format eligibility junction |
challenge_sessions | Multi-turn conversational AI challenge state |
user_subscriptions | Free/Eko+ subscription status |
ai_cost_log | Per-call AI spend tracking for budget enforcement |
ai_cost_tracking | Daily cost aggregation by provider, model, and feature |
ai_model_tier_config | DB-driven model tier routing (changeable via SQL, no restart) |
score_disputes | AI-judged score disputes with decision types |
reward_milestones / user_reward_claims | Engagement rewards (100/500/1000/2000 points → free Eko+ days) |
seed_entry_queue | Priority-ordered seed entry consumption |
super_fact_links | Cross-entry correlation junction for super facts |
unmapped_category_log | Audit log for unresolved news API category slugs |
3. Cascading Effects — What Breaks What
Eko has seven critical dependency chains. Understanding these prevents accidental breakage.
3a. Shared Schemas
packages/shared/src/schemas.ts
├── packages/queue (message validation)
├── packages/ai (extraction output schemas)
├── packages/db (query type safety)
├── apps/web (API request/response validation)
├── apps/admin (content display types)
└── apps/worker-* (message parsing)
What: The @eko/shared package exports Zod schemas for every queue message type, every domain entity, and every API contract. It is imported by every other package and app.
Why it matters: A breaking change to a Zod schema (renaming a field, changing a type) cascades to every consumer. If INGEST_NEWS payload shape changes, packages/queue fails to validate, worker-ingest fails to parse, and the entire ingestion pipeline stops.
Safe change pattern: Add optional fields (non-breaking). For required field changes, update all consumers in the same PR and run bun run typecheck across the monorepo.
What happens if... a schema field is removed? Detection:
bun run typecheckfails immediately across dependent packages. Recovery: Revert the change or update all consumers before merging.
3b. Queue System
packages/queue/src/index.ts
├── apps/worker-ingest (3 queue types)
├── apps/worker-facts (6 queue types)
├── apps/worker-validate (1 queue type)
└── apps/web/app/api/cron/* (enqueue messages)
What: The queue package provides enqueue(), dequeue(), ack(), and nack() operations backed by Upstash Redis. Every cron and every worker depends on it.
Why it matters: If the queue system is misconfigured (bad Redis URL, schema mismatch), all async processing stops. Facts pile up unprocessed, validation stalls, and the feed goes stale.
Safe change pattern: Test queue changes with SOAK_QUEUE_SUFFIX for isolated testing before deploying to production queues.
3c. Topic Taxonomy
topic_categories (database)
+ fact_record_schemas (database)
+ topic_category_aliases (database)
├── apps/worker-facts (extraction schema selection + alias resolution)
├── apps/worker-validate (validation context)
├── apps/worker-facts (challenge generation + taxonomy voice)
├── apps/web/app/api/feed (category filtering)
└── apps/web/components (category chips, filters)
What: The topic_categories table defines the hierarchical taxonomy (sports → basketball → NBA). Each category links to a fact_record_schemas row that defines what structured fields a fact in that category must have. topic_category_aliases maps external news API slugs to internal categories. Subcategory schemas auto-inherit from parent via trg_inherit_parent_schema trigger.
Why it matters: Adding a new topic category requires a migration, schema definition, and propagation to challenge formats. Removing or renaming a category breaks extraction for that topic. The taxonomy voice layer and content rules depend on slug-based lookups.
3d. AI Model Router + Adapters
packages/ai/src/model-router.ts
+ packages/ai/src/models/registry.ts
+ packages/ai/src/models/adapters/*.ts (12 adapters)
├── Fact extraction (worker-facts)
├── Fact validation (worker-validate)
├── Evergreen generation (worker-facts)
├── Challenge content generation (worker-facts)
├── Seed explosion (worker-facts)
└── Conversational challenges (apps/web API)
What: The model router selects which AI model handles each call based on a three-tier system: default (92% of calls — cost-efficient), mid (5% — higher quality), high (1% — top-tier reasoning). Tier-to-model mapping is database-driven via ai_model_tier_config with 60-second caching. Each model has a ModelAdapter that injects per-model prompt optimizations (suffix/prefix/override modes) to exploit strengths and mitigate weaknesses.
6 AI Providers:
| Provider | Models | Status |
|---|---|---|
| OpenAI | gpt-5-mini, gpt-5-nano, gpt-4o-mini | Active |
| Anthropic | claude-haiku-4-5, claude-opus-4-6 | Active |
| gemini-2.5-flash, gemini-2.0-flash-lite, gemini-3-flash-preview | Active | |
| xAI | grok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning, grok-4 | Active |
| Mistral | mistral-large-latest, mistral-medium-latest, mistral-small-latest | Active (no adapter) |
| DeepSeek | — | Removed |
Why it matters: If the configured model's API key is missing or the provider is down, AI operations fall back to the default tier. Budget caps ($5/day Anthropic, $3/day Google) provide cost protection with graceful degradation.
3e. Validation Pipeline
worker-validate
└── fact_records.status = 'validated'
└── GENERATE_CHALLENGE_CONTENT (post-validation trigger)
└── fact_challenge_content
└── /api/feed (only shows validated facts with challenges)
└── User's feed
What: The validation pipeline is the gate between extraction and the user's feed. Only facts with status = 'validated' appear in the feed. Challenge content is generated after validation passes.
Why it matters: If worker-validate is down or all validations fail, new facts accumulate in pending status. The feed doesn't break — it just stops showing new content. Existing validated facts continue to display.
3f. Subscription Gating
Stripe webhooks → user_subscriptions
└── /api/cards/[slug] (subscription check)
└── Card detail access
What: Stripe webhook events update user_subscriptions. The card detail API checks subscription status before returning gated content. 14-day trial with CC collection via Stripe Checkout.
3g. Challenge Content Voice Stack
CHALLENGE_VOICE_CONSTITUTION (universal)
+ TAXONOMY_VOICE (per-domain, 33+ categories)
+ STYLE_VOICE (per-format, 6 voices)
+ STYLE_RULES (tease-and-hint architecture)
+ ModelAdapter (per-model prompt optimization)
+ Drift coordinators (5 pluggable quality checks)
└── Generated challenge content
What: Challenge content quality depends on a layered voice stack that builds from universal principles down to model-specific adaptations. The drift coordinator system (structure, schema, voice, taxonomy, difficulty) detects semantic drift in generated content.
Why it matters: Removing or modifying a voice layer without updating downstream layers creates quality drift. The taxonomy voice layer depends on slug-based lookups — any slug changes must propagate. ModelAdapter eligibility is tracked via JSONL with a 97% structural / 90% subjective threshold system (tiered eligibility).
Summary Matrix
| System | Blast Radius | If It Breaks... |
|---|---|---|
| Shared Schemas | Total | Every package fails to compile |
| Queue System | Total | All async processing stops |
| Topic Taxonomy | High | New facts can't be extracted for affected topics |
| AI Model Router + Adapters | High | All AI falls to default tier or fails entirely |
| Validation Pipeline | Medium | Facts queue up but don't reach feed |
| Subscription Gating | Medium | Paying users can't access card details |
| Challenge Voice Stack | Medium | Challenge quality degrades silently |
4. The Agents — Who Owns What
Think of it like... a hospital with specialized departments and a chief of staff. Each agent owns a specific domain, and the architect-steward (chief of staff) ensures they all work together without stepping on each other.
Eko uses a system of 17 specialized Claude Code agents (plus 5 deprecated) to prevent scope creep and ensure clear ownership.
Pipeline Agents
These agents own the data flow from news to card:
ingest-engineer
└── informs → fact-engineer
└── informs → validation-engineer
└── informs → card-ux-designer
| Agent | Owns | Key Files |
|---|---|---|
| ingest-engineer | News fetch, clustering, images | apps/worker-ingest/** |
| fact-engineer | AI extraction, evergreen, challenges, model adapters | apps/worker-facts/**, packages/ai/** |
| validation-engineer | Multi-tier verification | apps/worker-validate/** |
| card-ux-designer | Feed, card detail, quiz UI | apps/web/app/feed/**, packages/ui/** |
Cross-Cutting Agents
| Agent | Role |
|---|---|
| architect-steward | Enforces v2 invariants, routes to correct agent |
| security-reviewer | RLS, SSRF prevention, secrets handling |
| ci-quality-gatekeeper | CI stability, linting, typecheck, builds |
| db-migration-operator | Schema changes, migrations, RLS policies |
| queue-sre | Queue health, DLQ monitoring, backoff tuning |
| cron-scheduler | Cron route creation, scheduling, ingestion_runs |
| platform-config-owner | Environment config, runtime settings |
| observability-analyst | Structured logging, trace correlation |
| subscription-manager | Free/Eko+ plans, Stripe, entitlements |
| admin-operator | Admin dashboard, content moderation |
| docs-librarian | Documentation health, link integrity |
| release-manager | Versioning, changelogs, rollback plans |
Quick Lookup: "I need to change X, which agent do I talk to?"
| If you're changing... | Talk to... |
|---|---|
| A news provider adapter | ingest-engineer |
| AI extraction prompts or model adapters | fact-engineer |
| Validation logic | validation-engineer |
| Feed algorithm or card UI | card-ux-designer |
| Database schema | db-migration-operator |
| Queue configuration | queue-sre |
| Cron schedules | cron-scheduler |
| Environment variables | platform-config-owner |
| Stripe/billing | subscription-manager |
| Admin dashboard | admin-operator |
| Challenge voice/taxonomy rules | fact-engineer |
| Not sure? | architect-steward (routes you to the right agent) |
5. Rules, Quality Gates, and CI
Think of it like... a building's fire code — some rules are alarms (CI blocks merge), some are sprinklers (pre-commit hooks catch issues), and some are inspections (advisory reviews).
The 7 Invariants
These are Eko's non-negotiable constraints:
| ID | Invariant | What it means |
|---|---|---|
| INV-001 | Fact-first | Facts are the atomic unit. Everything flows from structured, schema-validated facts. |
| INV-002 | Verification before publication | No fact reaches the public feed without at least one validation tier pass. |
| INV-003 | Source attribution | Every fact traces back to source articles and validation evidence. |
| INV-004 | Schema conformance | Fact output must validate against fact_record_schemas.fact_keys. |
| INV-005 | Cost-bounded AI | All AI calls have model routing, budget caps ($5/day Anthropic, $3/day Google), and cost tracking. |
| INV-006 | Public feed / gated detail | Feed is public; full card detail and interactions require Free/Eko+ subscription. |
| INV-007 | Topic balance | Daily quotas per topic category prevent content monoculture. |
Tradeoff priority: When invariants conflict, prefer: correctness → auditability → safety → cost control.
What CI Checks
The bun run ci pipeline runs these checks in order:
docs:lint:strict— Frontmatter validation on all markdown filesdocs:health— Documentation health score (≥95% threshold)docs:binding-check— Code path references in doc frontmatter existprompts:check— All file paths in prompt code blocks exist on diskagents:routing-check— No file ownership overlaps between agentsrules:check— Rules index is currentscripts:check— Script index is currentmigrations:check— Migrations index is currentbible:check— Product bible references are accuratetaxonomy:completeness-check— Voice coverage, content rules, vocabulary depth across all 32 taxonomy slugslint— Biome linting across all packagesregistry:check— UI component registry is validenv:check-example—.env.exampleis completeenv:check-typos— No common env file typostypecheck— TypeScript type checkingtest— Vitest test suite
Pre-Commit Hooks
- Biome lint + format on staged
.ts,.tsx,.js,.jsx,.json,.mdfiles - Plan governance — blocks commits that modify
status: lockedplan files
6. Operations — Crons, Workers, and Environment
Cron Schedule Overview
Scheduled in vercel.json (5):
| Cron | Schedule | Status |
|---|---|---|
| payment-reminders | Daily 9AM UTC | active |
| payment-escalation | Daily 9AM UTC | active |
| account-anniversaries | Daily 9AM UTC | active |
| daily-cost-report | Daily 6AM UTC | active |
| monthly-usage-report | 1st of month | deprecated (stub) |
Not yet scheduled (8) — active code but no vercel.json entry (OPS-004):
| Cron | Intended Schedule | Purpose |
|---|---|---|
| ingest-news | Every 15 min | Primary news pipeline trigger |
| cluster-sweep | Every hour | Cluster unclustered articles |
| generate-evergreen | Daily 3AM UTC | Generate timeless knowledge facts |
| validation-retry | Every 4 hours | Re-enqueue stuck validations |
| archive-content | Daily 2AM UTC | Promote/archive facts by engagement |
| topic-quotas | Daily 6AM UTC | Audit fact counts vs quotas |
| import-facts | Daily 4AM UTC | Stub for structured API imports |
| daily-digest | — | Deprecated stub |
What happens if... crons don't fire? Currently all deployments target Vercel preview, and crons only run on production (OPS-005). This means no automated pipeline processing is active until production deployment. Manual triggering works via
curl -X POST /api/cron/<name> -H "Authorization: Bearer $CRON_SECRET".
Worker Health
All workers expose a /health endpoint on port 8080, send heartbeats every 30 seconds, and use 2-minute lease durations. Workers implement graceful shutdown via abort signals. WORKER_CONCURRENCY env var controls parallel messages per queue type (default: 1, set 3-5 for seeding).
Environment Controls
Central config lives in packages/config/src/index.ts. Key controls:
| Control | Default | Purpose |
|---|---|---|
AI_PROVIDER | anthropic | Primary AI provider |
ANTHROPIC_DAILY_SPEND_CAP_USD | $5.00 | Daily budget cap — falls back to GPT-4o-mini when exhausted |
GOOGLE_DAILY_SPEND_CAP_USD | $3.00 | Daily budget cap for Gemini calls |
OPUS_ESCALATION_ENABLED | false | Allow routing to Opus for top-1% complex tasks |
OPUS_MAX_DAILY_CALLS | 20 | Hard cap on Opus invocations |
EVERGREEN_ENABLED | false | Master switch for evergreen fact generation |
EVERGREEN_DAILY_QUOTA | 20 | Max evergreen facts per day |
WORKER_CONCURRENCY | 1 | Parallel message handlers per queue (set 3-5 for seeding) |
NOTABILITY_THRESHOLD | 0.6 | Minimum score to retain a fact (0.0-1.0) |
The Seeding Pipeline
New topic categories are bootstrapped with content using the seed pipeline (see Pipeline C in Section 1 for the full walkthrough):
- Curated entries are generated (AI-assisted) for the topic
EXPLODE_CATEGORY_ENTRYqueue messages expand each entry into 10-100 structured facts (file_seed+spinoff_discovery)FIND_SUPER_FACTSdiscovers cross-entry correlations (ai_super_fact)VALIDATE_FACTverifies the generated facts (same pipeline as news and evergreen)GENERATE_CHALLENGE_CONTENTcreates quiz content for each validated fact
The seed pipeline uses higher WORKER_CONCURRENCY (3-5) for throughput.
7. Feature Improvements — Where Eko Can Level Up
Pipeline
- Schedule the 8 unscheduled crons (OPS-004) — highest priority operational gap. The news pipeline, clustering, evergreen generation, and validation retry all have working code but no scheduler entry.
- Deploy to production (OPS-005) — crons only fire on Vercel production deployments. Currently on preview.
- Remove deprecated stubs (OPS-001/002/003/006/007) — monthly-usage-report, daily-digest, twilio webhook, and SEND_SMS queue type are all dead code.
- Add retry visibility — surface DLQ counts in admin dashboard alerts so operators know when messages are failing.
UX
- Offline card access — cache validated facts for offline review.
- Progress dashboard — show user's learning streaks, topic coverage, and spaced repetition stats.
- Social sharing — share fact cards to social media with OG images.
- Onboarding tutorial — guided first-time experience explaining challenges.
- Entity browsing ✅ —
/entity/[id]detail pages showing all FCGs for an entity (e.g., "Babe Ruth"), linked entities, and an/exploresearch/browse page. Cross-FCG title leak detection prevents sibling FCG titles from revealing each other's challenge answers.
AI & Content Quality
- Feedback loop — use user dispute data to improve extraction prompts.
- Multi-language facts — extract facts in multiple languages for broader audience.
- Difficulty calibration — use answer accuracy data to auto-calibrate challenge difficulty.
- Expand model adapter coverage — Mistral adapters, further Gemini optimization.
Operations
- Alerting integration — connect Sentry/PagerDuty for worker failures and budget overruns.
- Queue dashboard improvements — real-time processing rates, historical throughput graphs.
- Automated canary deploys — deploy to a subset of users before full rollout.
Business
- Team plans — enable shared Eko+ accounts for classrooms and organizations.
8. Glossary
| Term | Definition |
|---|---|
| Fact record | The atomic unit of Eko — a structured, schema-validated piece of knowledge with key-value facts, a title, and a notability score |
| Entity | The real-world subject driving an FCG — a person, place, event, or concept (e.g., "Babe Ruth", "2008 Global Financial Crisis"). Stored as seed_entry_queue entries, linked to facts via fact_records.seed_entry_id. Entities have detail pages (/entity/[id]) showing all their FCGs. |
| Topic category | A node in the hierarchical taxonomy (e.g., sports → basketball → NBA). Each has its own fact schema. 33+ root categories, 76+ subcategories. |
| Schema key | A typed field definition in fact_record_schemas.fact_keys (e.g., player_name: text, career_points: number) — domain-specific per category |
| Notability score | A 0.0-1.0 score indicating how noteworthy a fact is. Below the threshold (default 0.6), facts are discarded. |
| Challenge format | One of 8 named quiz formats (Big Fan Of, Know A Lot About, Repeat After Me, Good With Dates, Degrees of Separation, Used To Work There, Partial Pictures, Originators) |
| Challenge style | The UI mechanic for a challenge: fill_the_gap, direct_question, statement_blank, reverse_lookup, free_text, multiple_choice, progressive_image_reveal, or conversational |
| Challenge title | A theatrical, per-challenge title generated to avoid answer-leak bugs |
| Spaced repetition | SM-2 variant algorithm scheduling review intervals [4h, 1d, 3d, 7d, 14d, 30d] based on answer streak |
| DLQ | Dead-letter queue — where messages go after 3 failed processing attempts. Requires manual inspection. |
| Cron | A scheduled task triggered at fixed intervals by Vercel's cron system (production only) |
| Queue message | A Zod-validated JSON payload sent via Upstash Redis to trigger async work in a worker |
| Validation tier | One of four phases: structural → internal_consistency → cross_model → evidence_corroboration |
| Model tier | AI model quality level: default (cheap/fast), mid (balanced), high (top-tier reasoning) |
| ModelAdapter | Per-model prompt customization with suffix/prefix/override modes and eligibility tracking (97% structural / 90% subjective threshold) |
| Drift coordinator | Pluggable quality checker (structure, schema, voice, taxonomy, difficulty) that detects semantic drift in AI-generated challenge content |
| Taxonomy voice | Per-domain emotional register injected between universal voice and per-format voice in challenge generation |
| Taxonomy content rules | Per-domain formatting and factual conventions injected into extraction and challenge prompts |
| Domain vocabulary | Per-category expert terms and phrases auto-generated via taxonomy CLI |
| Evergreen fact | A co-equal content pillar alongside news facts — timeless knowledge not tied to current events (source_type ai_generated). Generated daily via the GENERATE_EVERGREEN queue at mid model tier for quality. |
| News fact | A fact derived from clustered news articles (source_type news_extraction). Tied to current events and extracted via the ingestion → clustering → extraction pipeline. |
| Seed fact | A fact generated during topic bootstrapping (source_type file_seed or spinoff_discovery). Created by "exploding" curated entries into structured facts. |
| Super fact | A cross-entry correlation discovered by comparing facts across multiple seed entries (source_type ai_super_fact) |
| Source type | The origin of a fact record: news_extraction, ai_generated, file_seed, spinoff_discovery, ai_super_fact, or api_import |
| Story cluster | A group of news articles about the same event, clustered by TF-IDF cosine similarity |
| Seed pipeline | The process of bootstrapping a new topic category with AI-generated facts from curated entries |
| Blast radius | How many systems are affected when a component breaks |
| Worker | A Bun-based background process that consumes queue messages and processes them |
| Alias resolution | 3-step category lookup: exact slug → provider-specific alias → universal alias |
| Contamination detection | Entity-name sanity check on AI validation responses to prevent cross-model response mixing |
References
- App Control Manifest — Operational details for every cron, worker, queue, and API
- Agent Catalog — Full agent system with ownership boundaries
- Rules Index — All rules, conventions, and enforcement levels
- Seed Control — Seeding pipeline directives and cost estimates
- Model Code Isolation — ModelAdapter pattern and per-model prompt optimization