Seeding Best Practices
Companion guide to SEED.md. Covers strategies, common patterns, pitfalls, and worked examples for operating the Eko seeding system.
Table of Contents
- Mental Model
- Golden Rules
- Choosing a Seeding Mode
- Topic Strategy
- Volume Tuning
- Cost Management
- Quality Assurance
- Worked Examples
- Common Pitfalls
- Monitoring & Observability
- Recovery Procedures
Mental Model
The seeding system has three layers. Understanding them prevents confusion:
Layer 1: ENTITIES Layer 2: FACTS Layer 3: CHALLENGES
(seed_entry_queue) (fact_records) (fact_challenge_content)
"Julius Caesar" ──AI──> "Caesar crossed the ──AI──> 6 quiz styles:
"Cleopatra" Rubicon in 49 BC" multiple_choice
"Ancient Egypt" "Cleopatra ruled Egypt direct_question
from 51-30 BC" fill_the_gap
...50-100 facts each statement_blank
reverse_lookup
free_text
Key insight: You control Layer 1 (what entities to seed). Layers 2 and 3 are generated automatically. More entities = more facts = more challenges = more cost. The richness tier controls the Layer 1 → Layer 2 expansion ratio.
Golden Rules
-
Always dry-run first. Every script supports
--dry-run. Use it. A 500-entity seeding run costs ~$68 — previewing costs nothing. -
Seed deep, not wide. 200 entities across 3 topics produces better content than 50 entities across 12 topics. Depth creates interconnected facts that make better quizzes.
-
Let the pipeline finish. Don't start new seeding runs while workers are still processing. Check
seed_entry_queuestatus before starting. -
JSONL is your checkpoint. All batch scripts write to local JSONL files before touching the database. If anything goes wrong, your progress is saved. Re-running resumes from where it left off.
-
Validate before upload. Always run
--validateafter--generateand before--upload. Catch quality issues before they reach the database. -
Use difficulty 0 for balanced coverage. Setting
challenge_difficulty: 0generates challenges across all 5 difficulty levels in a single run (sequentially). This is ideal for initial corpus builds. For incremental expansion, use a single level (1-5) to control cost.
Choosing a Seeding Mode
Decision Tree
Is the system already running with enough content?
YES → mode: news-only (autopilot, ~$0.30/day)
NO ↓
Do you need timeless/trivia content or news-driven content?
NEWS → mode: news-only (let crons handle it)
TIMELESS ↓
Is this a first-time build or incremental expansion?
FIRST TIME → mode: full-pipeline
EXPANSION → mode: curated-seed
Do you also want daily AI-generated facts?
YES → mode: evergreen-boost (add to above)
NO → stick with curated-seed
Mode Comparison
| Mode | Cost/Day | Operator Effort | Content Type | Best For |
|---|---|---|---|---|
news-only | ~$0.30 | None (automated) | Current events | Steady state |
curated-seed | One-time batch | Medium (run scripts) | Deep knowledge | Topic expansion |
evergreen-boost | ~$0.40 | Low (toggle on) | Timeless AI facts | Supplementing news |
full-pipeline | One-time + daily | High (monitor everything) | All types | Initial build |
Topic Strategy
Which Topics to Seed First
Prioritize topics that are:
- Quiz-friendly — facts with clear, verifiable answers (history, science, geography)
- Entity-rich — many distinct things to learn about (sports players, countries, animals)
- Evergreen — facts that don't expire (records, nature, space)
Avoid starting with topics that are:
- Rapidly changing — current events are better served by news ingestion
- Subjective — opinions, rankings, "best of" lists make poor quiz questions
- Narrow — very niche topics produce thin, repetitive content
Recommended First-Seed Priority
# Tier 1: Seed these first (highest quiz potential)
topics:
history: { priority: high, count: 700 } # Deep, entity-rich, evergreen
science: { priority: high, count: 500 } # Verifiable, educational
geography: { priority: high, count: 400 } # Clear answers, visual
# Tier 2: Seed after Tier 1 is validated
sports: { priority: medium, count: 300 } # Popular, stat-heavy
culture: { priority: medium, count: 300 } # Broad appeal
animals: { priority: medium, count: 200 } # Fun facts, educational
# Tier 3: Fill in later
technology: { priority: low, count: 150 }
food-beverage: { priority: low, count: 100 }
space: { priority: low, count: 100 }
Subcategory Balance
Within a topic, distribute counts unevenly — give more to subcategories with richer entity pools:
# Good: weighted by entity richness
history:
subcategories:
- name: "Historic Figures"
count: 200 # Huge entity pool (thousands of notable people)
- name: "Ancient Civilizations"
count: 150 # Rich but bounded
- name: "Modern History"
count: 100 # Events > entities
- name: "Post-War & Contemporary"
count: 50 # Smaller, more recent
# Bad: equal distribution ignores entity density
history:
subcategories:
- name: "Historic Figures"
count: 125
- name: "Ancient Civilizations"
count: 125
- name: "Modern History"
count: 125
- name: "Post-War & Contemporary"
count: 125 # Will produce thin, repetitive content
Volume Tuning
Richness Tier Guide
The richness tier controls how many facts the AI generates per entity. Choose based on the topic:
| Tier | Facts/Entity | Best For | Example |
|---|---|---|---|
high | 50-100 | Entities with deep, varied factual content | "Albert Einstein", "Ancient Rome", "Michael Jordan" |
medium | 20-50 | Entities with moderate factual depth | "Uranium", "Costa Rica", "Bluetooth" |
low | 10-20 | Entities where facts are limited or repetitive | "Helvetica font", "Quinoa", "Podcasting" |
Rule of thumb: If you can think of 20+ interesting facts about a typical entity in the topic, use high. If you'd struggle past 10, use low.
Batch Size Planning
Entities × Facts/Entity = Total Facts
Total Facts × Styles = Total Challenges
Total Facts × $0.001/style = Challenge Generation Cost
Example (all 6 styles):
500 entities × 35 facts (medium) = 17,500 facts
17,500 × 6 = 105,000 challenges
17,500 × $0.006 = ~$105 challenge cost
+ ~$6 explosion cost + ~$2 validation ≈ $113 total
Example (3 core styles — mc,dq,ftg):
500 entities × 35 facts = 17,500 facts
17,500 × 3 = 52,500 challenges
17,500 × $0.003 = ~$52 challenge cost
+ ~$6 explosion cost + ~$2 validation ≈ $60 total
Incremental Seeding
Don't try to seed everything at once. Build incrementally:
Week 1: 200 entities across history + science (Tier 1)
→ Validate quality, check quiz UX
→ Estimated: ~7,000 facts, ~$45
Week 2: 300 entities across geography + sports + culture (Tier 2)
→ Build on learnings from Week 1
→ Estimated: ~10,000 facts, ~$68
Week 3: Enable evergreen-boost (20 facts/day ongoing)
→ Automated, low-cost supplement
→ Estimated: ~$0.10/day
Week 4: Expand weak topics, backfill challenge content gaps
→ Use --audit to find gaps
→ Targeted, efficient
Cost Management
Cost Breakdown by Phase
For a typical 500-entity medium-richness run:
Phase Cost % of Total Duration
─────────────────────────────────────────────────────────
Entity generation ~$1.00 1.5% 5 min
Fact explosion ~$5.00 7.3% 2-4 hours
Validation ~$2.00 2.9% 30 min
Challenge content ~$60.00 87.0% 8-12 hours
Content cleanup ~$0.00 0.0% (skip unless needed)
─────────────────────────────────────────────────────────
Total ~$68.00 100% ~12-16 hours
Challenge content dominates cost. This is because each fact generates 6 detailed challenge styles with setup text, challenge text, reveal text, and a 3-6 sentence correct answer. If budget is tight:
Cost Reduction Strategies
-
Skip challenge generation initially. Seed entities and facts first, generate challenges later when budget allows. Facts are useful without challenges (they still appear on cards).
-
Generate fewer styles. Set
challenge_styles: mc,dq,ftgin SEED.md (or--styles mc,dq,ftgon the CLI) to generate only the 3 core styles. This halves challenge cost. You can add the remaining styles later — the DB composite key supports incremental style addition. -
Use lower richness tiers. Dropping from
medium(35 facts/entity) tolow(15 facts/entity) cuts all downstream costs by ~60%. -
Partition across days. Run 125 entities per day over 4 days instead of 500 at once. Spreads API cost and lets you catch issues early.
Budget Templates
# Budget: $25 (starter)
volume:
max_entities: 150
richness_tier: low
challenge_difficulty: 1 # Single difficulty to control cost
# Expected: ~2,250 facts, ~13,500 challenges
# Budget: $75 (standard)
volume:
max_entities: 500
richness_tier: medium
challenge_difficulty: 1 # Start with easy, expand later
# Expected: ~17,500 facts, ~105,000 challenges
# Budget: $200 (comprehensive)
volume:
max_entities: 1500
richness_tier: medium
challenge_difficulty: 0 # Balanced spread — all 5 difficulty levels
quality:
cleanup_after_seed: true
# Expected: ~52,500 facts, ~315,000 challenges
Quality Assurance
Pre-Seed Checklist
Before starting a seeding run:
-
bun scripts/seed/generate-challenge-content.ts --audit— check current coverage - Verify
.env.localhasOPENAI_API_KEYandDATABASE_URL - Check
seed_entry_queuefor stuck entries: nostatus = 'processing'older than 10 min - Check Redis queue depth: no large backlog from previous runs
- Run
--dry-runfirst and review the output
Post-Seed Validation
After a seeding run completes:
# 1. Check fact quality
bun scripts/seed/generate-challenge-content.ts --validate
# 2. Check for NULL metadata
bun scripts/seed/backfill-fact-nulls.ts --audit
# 3. Sample review (manually read 10-20 facts)
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const sample = await db.execute(sql\`
SELECT title, challenge_title, notability_score, topic_category_id
FROM fact_records
WHERE source_type = 'file_seed' AND status = 'validated'
ORDER BY RANDOM() LIMIT 10
\`);
for (const row of sample) console.log(row.title, '|', row.notability_score);
process.exit(0);
"
Quality Signals to Watch
| Signal | Healthy | Warning | Action |
|---|---|---|---|
| Notability scores | 0.6-0.9 average | Below 0.5 average | Run --recover to reprocess weak facts |
| CQ-002 patch rate | < 30% patched | > 50% patched | AI prompt may need tuning |
| Error rate | < 1% | > 5% | Check API rate limits, reduce concurrency |
| Duplicate facts | < 2% | > 5% | Check dedup logic, may need manual cleanup |
| Challenge coverage | Expected styles per fact | < expected styles | Run --recover for missing styles |
Worked Examples
Example 1: First-Time Science Seeding
Goal: Build a deep science knowledge base from scratch.
SEED.md configuration:
mode: curated-seed
topics:
science:
enabled: true
priority: high
subcategories:
- name: "Physics & Space"
count: 100
- name: "Biology & Medicine"
count: 80
- name: "Chemistry & Materials"
count: 60
- name: "Earth & Environmental"
count: 60
volume:
richness_tier: medium
max_entities: 300
challenge_difficulty: 1
quality:
generate_challenges: true
cleanup_after_seed: false
execution:
concurrency: 5
partitions: 4
dry_run_first: true
auto_upload: false
Expected outcome:
- ~300 entities (Newton, DNA, Periodic Table, Black Holes, etc.)
- ~10,500 facts (35 avg per entity at medium tier)
- ~63,000 challenges (6 styles per fact)
- Cost: ~$70
- Duration: ~10-14 hours
Session commands:
"Read SEED.md and execute. Start with dry-run."
→ Claude previews: 300 entities, 4 subcategories, ~$70 estimate
→ You confirm
"Proceed with generation."
→ Claude runs: generate-curated-entries, bulk-enqueue, monitors explosion
→ Reports: 10,342 facts generated, 23 errors
"Generate challenge content."
→ Claude runs: generate-challenge-content --generate --partition 1/4 (x4)
→ Reports: 62,052 challenges, $63.20 cost
"Validate and upload."
→ Claude runs: --validate (quality check), --upload (DB upsert)
→ Reports: 100% coverage, 28% CQ-002 patched, 0 skipped
Example 2: Targeted Topic Expansion
Goal: Add 2,000 geography facts to supplement an existing base.
SEED.md configuration:
mode: curated-seed
topics:
geography:
enabled: true
priority: high
subcategories:
- name: "Countries & Capitals"
count: 200
- name: "Natural Wonders"
count: 80
- name: "Rivers, Mountains & Oceans"
count: 60
- name: "Cultural Geography"
count: 40
volume:
richness_tier: low # Geography facts are concise
max_entities: 380
max_facts: 6000 # Cap to control cost
challenge_difficulty: 1
execution:
concurrency: 5
partitions: 2 # Smaller run, 2 partitions enough
dry_run_first: true
Expected outcome:
- ~380 entities
- ~5,700 facts (15 avg at low tier)
- ~34,200 challenges
- Cost: ~$38
- Duration: ~6-8 hours
Example 3: Enabling Evergreen for Steady Growth
Goal: Add 20 AI-generated timeless facts per day to keep content fresh.
SEED.md configuration:
mode: evergreen-boost
evergreen:
enabled: true
daily_quota: 20
distribution:
science: 20%
history: 20%
geography: 15%
culture: 15%
sports: 10%
animals: 10%
records: 10%
What Claude does:
- Sets
EVERGREEN_ENABLED=truein.env.local - Sets
EVERGREEN_DAILY_QUOTA=20 - Updates
topic_categories.percent_targetfor listed topics - Verifies the
generate-evergreencron is active - Reports: "Evergreen enabled. 20 facts/day, ~$0.10/day, distributed across 7 topics."
Expected outcome:
- 20 new facts/day (600/month)
- Fully automated (no operator intervention)
- Cost: ~$3/month
- Challenge content generated automatically via queue
Example 4: Emergency Content Boost
Goal: You need 5,000 facts across entertainment and sports in 24 hours for a launch.
SEED.md configuration:
mode: curated-seed
topics:
entertainment:
enabled: true
priority: high
subcategories:
- name: "Movies & Directors"
count: 150
- name: "Music Artists & Albums"
count: 150
- name: "TV Shows"
count: 100
sports:
enabled: true
priority: high
subcategories:
- name: "Football (American)"
count: 80
- name: "Basketball"
count: 80
- name: "Soccer"
count: 80
- name: "Olympic Sports"
count: 60
volume:
richness_tier: high # Maximum facts per entity
max_facts: 5000
challenge_difficulty: 1
execution:
concurrency: 8 # Aggressive concurrency
partitions: 8 # Maximum parallelism
dry_run_first: false # Skip preview, move fast
auto_upload: true # Auto-push to DB
Estimated cost: ~$35 (explosion) + ~$30 (challenges) = ~$65 Estimated duration: ~6-8 hours with 8 partitions at concurrency 8
Common Pitfalls
1. Starting a new run while workers are still processing
Symptom: seed_entry_queue has thousands of status = 'processing' rows.
Fix: Wait for the current run to finish, or reset stuck entries:
UPDATE seed_entry_queue SET status = 'pending'
WHERE status = 'processing' AND updated_at < NOW() - INTERVAL '10 minutes';
2. Running out of API budget mid-run
Symptom: Errors spike to 100%, all messages are rate limit errors.
Fix: Reduce concurrency and wait. The scripts are resume-safe — just re-run after the rate limit window resets.
3. Generating challenges before validation completes
Symptom: generate-challenge-content --export finds 0 facts to process.
Why: Facts must be status = 'validated' before challenge content can be generated. If validation workers haven't run, facts are still pending_validation.
Fix: Run bun run dev:worker-validate and wait for validation to complete before generating challenges.
4. Forgetting to upload after generation
Symptom: Challenges exist in JSONL files but not in the database. The feed shows facts without quiz content.
Fix: Run --upload to push JSONL data to the database:
bun scripts/seed/generate-challenge-content.ts --upload
5. Seeding topics that aren't in the taxonomy
Symptom: Entity generation succeeds but explosion fails with "topic_category_id not found."
Fix: Check that the topic slug exists in topic_categories and is is_active = true. New topics need a migration first.
6. Duplicate content across runs
Symptom: Same facts appearing multiple times on the feed.
Why: The dedup check uses title + topic_category_id. If you re-run entity generation with slightly different prompts, the AI may produce entities with different names that generate overlapping facts.
Prevention: Don't re-run generate-curated-entries.ts --insert for topics that already have entries. Use --stats to check first.
Monitoring & Observability
During a Seeding Run
# Check seed entry queue status
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
SELECT status, COUNT(*) as count
FROM seed_entry_queue
GROUP BY status ORDER BY count DESC
\`);
console.table(result);
process.exit(0);
"
# Check fact generation progress
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
SELECT source_type, status, COUNT(*) as count
FROM fact_records
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY source_type, status ORDER BY count DESC
\`);
console.table(result);
process.exit(0);
"
After a Seeding Run
# Full audit
bun scripts/seed/generate-challenge-content.ts --audit
# Check topic distribution
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
SELECT tc.name, COUNT(fr.id) as facts
FROM fact_records fr
JOIN topic_categories tc ON fr.topic_category_id = tc.id
WHERE fr.status = 'validated'
GROUP BY tc.name ORDER BY facts DESC
\`);
console.table(result);
process.exit(0);
"
Recovery Procedures
Script Crashed Mid-Run
All scripts are resume-safe. Just re-run the same command:
# Picks up where it left off (reads JSONL to find completed IDs)
bun scripts/seed/generate-challenge-content.ts --generate --concurrency 5
Bad Content Generated
If a batch of facts has quality issues:
# 1. Identify the problem batch
bun scripts/seed/generate-challenge-content.ts --validate
# 2. Delete bad challenge content
# (fact_records remain, only challenges are regenerated)
bun scripts/seed/generate-challenge-content.ts --recover
# 3. Or run a full cleanup pass
bun scripts/seed/cleanup-content.ts --fix --concurrency 5
Need to Undo an Upload
If bad data was uploaded to the database:
-- Archive (soft-delete) facts from a specific batch
UPDATE fact_records
SET status = 'archived'
WHERE source_type = 'file_seed'
AND created_at > '2026-02-18T00:00:00Z'
AND topic_category_id = (SELECT id FROM topic_categories WHERE slug = 'bad-topic');
Challenge content is automatically excluded when the parent fact is archived.
Related Documents
- SEED.md — Seeding control prompt (edit this to direct seeding)
- runbook.md — Detailed operational procedures
- manual-seeding-guide.md — File-based seeding from XLSX/DOCX
- 01-taxonomy-expansion.md — Adding new topic categories
- 04-taxonomy-coherence.md — Category alias mapping
- ../../rules/challenge-content.md — Quality rules (CC/CQ)
- logs/ — Seed job logs (per-run records with costs, errors, results)