Seeding Control Prompt
Drop this file into a Claude Code session to direct seeding operations. Edit the sections below to control what gets seeded, how much, and at what quality level.
How it works: A Claude session reads this file, interprets your directives, and executes the appropriate scripts and commands. You control the what and how much — the system handles the how.
Quick Start
# In a Claude Code session, say:
# "Read docs/projects/seeding/SEED.md and execute my seeding directives"
Seeding Mode
Pick one. Uncomment the mode you want.
# mode: news-only
# Only automated news ingestion runs. No manual seeding.
# Good for: steady-state operations, low cost.
mode: curated-seed
# Generate curated entity entries for specified topics, explode into facts,
# validate, and generate challenge content.
# Good for: building deep topic coverage.
# mode: evergreen-boost
# Enable daily AI-generated evergreen facts in addition to news.
# Good for: supplementing news with timeless content.
# mode: full-pipeline
# News + curated seeding + evergreen + cleanup. Everything runs.
# Good for: initial corpus build or major expansion.
Topic Directives
Specify which topics to seed and how deeply. Only topics listed here will be included in manual seeding runs. News ingestion always covers all active root topics.
topics:
# Format: slug: count (entities to generate per subcategory)
# Higher count = more entities = more facts = higher cost
# Typical: 50-150 per subcategory
science:
enabled: true
priority: high # high | medium | low (controls processing order)
subcategories:
- name: "Physics & Space"
count: 100
- name: "Biology & Medicine"
count: 100
- name: "Chemistry & Materials"
count: 75
- name: "Scientific Instruments & Methods"
count: 50
- name: "Nobel Prize Winners & Discoveries"
count: 75
history:
enabled: true
priority: high
subcategories:
- name: "Ancient Civilizations"
count: 150
- name: "Medieval & Renaissance"
count: 100
- name: "Modern History"
count: 150
- name: "Post-War & Contemporary"
count: 100
- name: "Historic Figures"
count: 200
# geography:
# enabled: true
# priority: medium
# subcategories:
# - name: "Countries & Capitals"
# count: 200
# - name: "Natural Wonders"
# count: 80
# culture:
# enabled: false # Skip this topic entirely
# --- Depth-2 targeting (future) ---
# When depth-2 pipeline support is wired, you can target leaf subcategories:
# science:
# subcategories:
# - name: "Physics & Space"
# depth2:
# - name: "Quantum Mechanics"
# count: 30
# - name: "Astrophysics"
# count: 30
44 topics available. The full list of topic slugs and their subcategories is defined in
packages/ai/src/config/categories.ts(source of truth).
4-Level Category Hierarchy
The topic taxonomy supports 4 levels of depth (0–3):
| Depth | Role | Count | Example |
|---|---|---|---|
| 0 | root | 44 | science |
| 1 | subcategory | 162 | Physics & Space |
| 2 | leaf | DB-derived | Quantum Mechanics |
| 3 | sub-leaf | DB-derived | (materialized from entities) |
Current behavior: Seeding targets depth 0→1. AI extraction classifies facts to the deepest matching level (depth 0–3) via "classify to the deepest matching level" prompt instructions. The descendant query (getDescendantCategoriesForParent) walks 3 levels deep (children → children → children). Depth-2 and depth-3 categories are materialized in the database via materialize-entity-categories.ts, not defined in categories.ts.
Full taxonomy: See docs/spreadsheets/category-taxonomy.md for all categories (note: taxonomy doc may lag behind categories.ts — regenerate with bun run taxonomy:completeness-check to verify).
Topic Presets
Instead of listing individual topics, you can use a preset:
# preset: all-active
# Seeds all active root topics using CATEGORIES defaults.
# Equivalent to: bun scripts/seed/generate-curated-entries.ts --insert
# preset: news-popular
# Seeds only topics that appear frequently in news APIs:
# current-events, technology, business, sports, science, entertainment
# preset: trivia-deep
# Deep seeding of knowledge/trivia-heavy topics:
# history, science, geography, culture, records, sports, animals, space
Volume Controls
volume:
# Richness tier overrides (controls facts generated per entity)
# high: 50-100 facts per entity (entertainment, sports, people)
# medium: 20-50 facts per entity (geography, science, animals)
# low: 10-20 facts per entity (business, design, fashion)
richness_tier: medium # default tier for all topics
# Maximum entities to seed in this run
max_entities: 500 # null = no limit (use CATEGORIES counts)
# Maximum total facts to generate (safety cap)
max_facts: 25000 # null = no limit
# Difficulty level for challenge content (0-5)
# 0 = balanced spread (generates all 5 levels evenly)
# 1 = easy (default), 2 = moderate, 3 = hard, 4 = very hard, 5 = expert
challenge_difficulty: 0 # 0=balanced spread across all levels
# Which challenge styles to generate per fact (default: all 5)
# mc = multiple_choice (4 options, most engaging)
# dq = direct_question (classic quiz format)
# ftg = fill_the_gap (cloze-style, good for learning)
# rl = reverse_lookup ("what am I describing?")
# ft = free_text (open-ended, hardest to auto-grade)
# Fewer styles = lower cost. 3 core styles (mc,dq,ftg) cover most use cases.
challenge_styles: mc,dq,ftg,rl,ft # all 5 (default)
# challenge_styles: mc,dq,ftg # budget-friendly: 3 core styles (~40% less cost)
Quality Controls
quality:
# Minimum notability score for keeping a fact (0.0 - 1.0)
notability_threshold: 0.6 # default: 0.6
# Validation strategy for AI-generated facts
# Uses 4-phase pipeline: structural → consistency → cross-model → evidence
validation_strategy: multi_phase # multi_phase (default) | legacy_ai_cross_check
# Run content cleanup after seeding?
cleanup_after_seed: false # true = rewrite titles/context for consistency
# Generate challenge content automatically?
generate_challenges: true # true = generate 5 quiz styles per fact
# CQ-002 enforcement (second-person address in challenges)
patch_cq002: true # true = auto-patch "you/your" into challenges
Evergreen Controls
Only applies when mode is evergreen-boost or full-pipeline.
evergreen:
enabled: false # Set true to enable daily AI fact generation
daily_quota: 20 # Facts per day across all topics
distribution: # Optional: override per-topic share
science: 25% # 5 facts/day
history: 25% # 5 facts/day
geography: 15% # 3 facts/day
# Remaining 35% split equally among other active topics
Execution Controls
execution:
# Concurrency for AI calls (per script instance)
concurrency: 5 # 1-10, higher = faster but more API pressure
# Number of parallel partitions for large runs
partitions: 4 # 1-8, each partition runs independently
# Dry run first?
dry_run_first: true # true = preview before executing
# Auto-upload to DB after generation?
auto_upload: false # true = automatically upsert to DB
# false = generate to JSONL, wait for manual upload
# AI model preference (DB-driven via ai_model_tier_config)
# Eligible models (97% threshold passed): run llm-fact-quality-testing.ts to verify
# Default: gemini-3-flash-preview (promoted Feb 25, 2026; regressed Mar 4 — retest before use)
# Note: gpt-5-mini fails on voice/style dimensions — not eligible for production seeding
# Model routing is DB-driven — change via SQL UPDATE, no restart needed
# Results: scripts/seed/.llm-test-data/eligibility.jsonl
model: gemini-3-flash-preview # or any model in ai_model_tier_config
News Ingestion Controls
Controls for the automated news pipeline. Changes here affect environment config.
news:
# Which providers to use (requires API keys in .env.local)
providers:
- newsapi # NEWS_API_KEY
- gnews # GOOGLE_NEWS_API_KEY
- thenewsapi # THENEWS_API_KEY
- newsdata # NEWSDATA_API_KEY
- event_registry # EVENT_REGISTRY_API_KEY
# Max articles per provider per category per run
max_results: 20 # default: 20
# Ingestion interval (cron frequency in minutes)
interval_minutes: 15 # default: 15
Enrichment Orchestrator
In addition to the primary news providers, the enrichment orchestrator (packages/ai/src/enrichment.ts) injects context from 8 free API sources during fact extraction. This is context injection, not primary article fetching.
| Source | Routing | Purpose |
|---|---|---|
| Knowledge Graph | Always | Entity identification, notability signals |
| Wikidata | Always | Structured facts, identifiers |
| Wikipedia | Always | Summary context, descriptions |
| GDELT 2.0 | Always | Global event context, media mentions |
| TheSportsDB | sports/* topics | Team/player data |
| MusicBrainz | music/* topics | Artist/album metadata |
| Nominatim | geography/* topics | Location geocoding |
| Open Library | books/* topics | Book/author data |
All enrichment calls use Promise.allSettled() — a failing API never blocks extraction.
Cost Estimates
Approximate costs per operation (using GPT-5 Mini / Gemini 3 Flash Preview):
| Operation | Per-Unit Cost | Example |
|---|---|---|
| Entity generation | ~$0.002/entity | 500 entities = ~$1.00 |
| Fact explosion | ~$0.01/entity | 500 entities = ~$5.00 |
| Challenge content | ~$0.006/fact | 10,000 facts = ~$60.00 |
| Content cleanup | ~$0.004/fact | 10,000 facts = ~$40.00 |
| News extraction | ~$0.003/story | 100 stories/day = ~$0.30/day |
| Evergreen generation | ~$0.005/fact | 20 facts/day = ~$0.10/day |
| Entity category materialization | ~$0.002/entity | 500 entities = ~$1.00 |
| Voice-pass regeneration | ~$0.003/record | 10,000 records = ~$30.00 |
| Model eligibility testing | ~$0.50/model | Per full 5-phase run |
Typical curated-seed run (500 entities, medium richness):
- Entity generation: ~$1
- Explosion: ~$5
- Validation: ~$2
- Challenge content: ~$60
- Total: ~$68
Daily Production Budget ($20/day) — Split Routing
The pipeline uses split model routing to maximize throughput within a $20/day AI budget. GPT-5.4 Nano handles cost-efficient bulk generation; topics where it produces poor quality are routed to Gemini 3 Flash Preview instead.
Per-model costs:
| Model | Input $/MTok | Output $/MTok | Cost/Fact | Cost/Challenge |
|---|---|---|---|---|
| gpt-5.4-nano | $0.20 | $1.25 | $0.006 | $0.0048 |
| gemini-3-flash-preview | $0.50 | $3.00 | $0.015 | $0.0106 |
Topic routing (as of March 30, 2026):
| Topics | % of entities | Model | Reason |
|---|---|---|---|
| Sports, Music, Science | ~60% | gemini-3-flash-preview | Nano fabricates sports stats, struggles with music nuance, fails science cross-challenge isolation |
| All other topics | ~40% | gpt-5.4-nano | Good quality at 2.2x lower cost |
Throughput at $20/day:
| Scenario | Weighted cost/challenge | Challenges/day | Facts/day |
|---|---|---|---|
| All Gemini | $0.0106 | ~1,887 | ~86 |
| Split (sports+music→Gemini) | $0.0076 | ~2,635 | ~120 |
| Split (sports+music+science→Gemini) | $0.0083 | ~2,415 | ~110 |
Adding science to the Gemini handoff reduces throughput by ~8% (~10 fewer facts/day) but eliminates wasted budget on science content that fails validation (46.7% → 85% validation rate improvement).
Model Eligibility
Before a model can be used for production seeding, it must pass the eligibility gate. The test pipeline (scripts/seed/llm-fact-quality-testing.ts) evaluates models across 7 quality dimensions and writes results to scripts/seed/.llm-test-data/eligibility.jsonl.
Threshold: All 7 dimensions must score 97% or higher.
Currently eligible: Verify before each run — gemini-3-flash-preview passed Feb 25 but regressed Mar 4; gemini-2.5-flash has mixed results. Run llm-fact-quality-testing.ts --all --models <model> to confirm. Models like gpt-5-mini fail on voice/style adherence dimensions.
| Dimension | What it measures |
|---|---|
| validation | Facts pass multi-phase validation pipeline |
| evidence | Facts corroborated by external evidence |
| challenges | Challenge content passes CQ rules |
| schema_adherence | Output conforms to topic-specific schemas |
| voice_adherence | Matches Eko voice constitution and taxonomy voice |
| style_adherence | Follows per-style rules |
| token_efficiency | Output stays within token budget |
Each model has a dedicated ModelAdapter (packages/ai/src/models/adapters/) that provides per-model prompt optimizations and signoff guidance for the quality reviewer. See Model Code Isolation for details.
Execution Checklist
When a Claude session reads this file, it should:
- Parse the directives above
- Verify prerequisites (
OPENAI_API_KEYorGOOGLE_API_KEY,DATABASE_URLin.env.local) - Verify model eligibility — confirm the configured model has
eligible: trueinscripts/seed/.llm-test-data/eligibility.jsonl. If not, runllm-fact-quality-testing.ts --all --models <model>first. - If
dry_run_first: true, run--dry-runand report estimates - Wait for user confirmation before proceeding
- Execute in order:
a. Generate curated entries (if mode includes curated seeding)
b. Bulk enqueue to workers
c. Monitor explosion progress
d. Run validation workers
e. Generate challenge content (if
generate_challenges: true) f. Run cleanup (ifcleanup_after_seed: true) g. Upload to DB (ifauto_upload: true) h. (Optional) Materialize entity categories (materialize-entity-categories.ts --auditthen--classify --insert --link) - Report final stats: facts generated, challenges created, cost, errors
Reference: Available Scripts
| Script | Purpose | Key Flags |
|---|---|---|
generate-curated-entries.ts | AI-generate entity names | --insert, --category <slug> |
bulk-enqueue.ts | Dispatch entries to explosion workers | (no flags) |
generate-challenge-content.ts | Batch challenge generation | --audit, --export, --generate, --upload, --validate, --recover, --dry-run, --limit N, --concurrency N, --partition N/M, --output-suffix NAME, --difficulty N, --styles mc,dq,ftg, --format <slug>, --drift-check, --export-all |
cleanup-content.ts | Rewrite titles/context | --audit, --export, --fix, --upload, --validate, --dry-run, --limit N, --concurrency N, --partition N/M, --output-suffix NAME |
backfill-fact-nulls.ts | Fill NULL metadata | --audit, --notability, --challenge-content, --all, --dry-run |
seed-from-files.ts | Parse XLSX/DOCX/CSV and seed | --parse, --explode, --explode-spinoffs, --super-facts, --stats, --all, --dry-run, --topic <slug>, --budget <dollars>, --resume, --batch-size <n> |
llm-fact-quality-testing.ts | Model eligibility testing | --all, --generate, --validate, --challenge, --signoff, --report, --models <csv>, --limit N, --concurrency N, --output-dir <name>, --merge-dirs <csv>, --signoff-model <id> |
materialize-entity-categories.ts | Create leaf categories from entities | --audit, --classify, --insert, --link, --concurrency N, --dry-run |
regen-voice-pass.ts | Regenerate voice-pass content | --dry-run, --limit N, --partition M/N, --concurrency N, --presplit |
rewrite-challenge-defects.ts | Fix challenge quality defects | --dry-run, --limit N, --partition M/N, --concurrency N, --presplit |
presplit-defects.ts | Pre-split defective records into partitions | --partitions N |
upsert-rewritten-challenges.ts | Upload rewritten challenges to DB | --dry-run |
cleanup-seed-queue.ts | Remove garbage entries from seed queue | (one-time) |
deepen-topic-paths.ts | AI-reclassify entries to deeper categories | --dry-run, --limit N, --concurrency N |
remap-topic-paths.ts | Batch fix malformed topic paths | --dry-run, --limit N |
improve-titles.ts | Deprecated — use cleanup-content.ts |
See runbook.md for detailed operational procedures.
Related Documents
- APP-CONTROL.md — App control manifest (crons, workers, queues, APIs)
- Category Taxonomy — Full 3-level category hierarchy (44 roots + 162 subcategories + DB-derived leaves)
- categories.ts — Source of truth for topic slugs and subcategories
- Ops Logs — Operational event logging (parallel to seed logs)
- README.md — Seeding pipeline documentation
- runbook.md — Operational procedures
- seeding-best-practices.md — Strategies and cost management
Recent Seed Logs
All seed jobs are logged in logs/ with structured frontmatter for tracking.
- Latest logs: February 2026
- Log template: logs/README.md
After each seeding run, create a log file using the template and update the monthly index.