Seeding Best Practices

Companion guide to SEED.md. Covers strategies, common patterns, pitfalls, and worked examples for operating the Eko seeding system.

Mental Model
Golden Rules
Choosing a Seeding Mode
Topic Strategy
Volume Tuning
Cost Management
Quality Assurance
Worked Examples
Common Pitfalls
Monitoring & Observability
Recovery Procedures

Mental Model

The seeding system has three layers. Understanding them prevents confusion:

Layer 1: ENTITIES           Layer 2: FACTS              Layer 3: CHALLENGES
(seed_entry_queue)          (fact_records)              (fact_challenge_content)

"Julius Caesar"    ──AI──>  "Caesar crossed the         ──AI──>  6 quiz styles:
"Cleopatra"                  Rubicon in 49 BC"                   multiple_choice
"Ancient Egypt"              "Cleopatra ruled Egypt               direct_question
                              from 51-30 BC"                     fill_the_gap
                             ...50-100 facts each                statement_blank
                                                                 reverse_lookup
                                                                 free_text

Key insight: You control Layer 1 (what entities to seed). Layers 2 and 3 are generated automatically. More entities = more facts = more challenges = more cost. The richness tier controls the Layer 1 → Layer 2 expansion ratio.

Golden Rules

Always dry-run first. Every script supports --dry-run. Use it. A 500-entity seeding run costs ~$68 — previewing costs nothing.
Seed deep, not wide. 200 entities across 3 topics produces better content than 50 entities across 12 topics. Depth creates interconnected facts that make better quizzes.
Let the pipeline finish. Don't start new seeding runs while workers are still processing. Check seed_entry_queue status before starting.
JSONL is your checkpoint. All batch scripts write to local JSONL files before touching the database. If anything goes wrong, your progress is saved. Re-running resumes from where it left off.
Validate before upload. Always run --validate after --generate and before --upload. Catch quality issues before they reach the database.
Use difficulty 0 for balanced coverage. Setting challenge_difficulty: 0 generates challenges across all 5 difficulty levels in a single run (sequentially). This is ideal for initial corpus builds. For incremental expansion, use a single level (1-5) to control cost.

Choosing a Seeding Mode

Decision Tree

Is the system already running with enough content?
  YES → mode: news-only (autopilot, ~$0.30/day)
  NO  ↓

Do you need timeless/trivia content or news-driven content?
  NEWS → mode: news-only (let crons handle it)
  TIMELESS ↓

Is this a first-time build or incremental expansion?
  FIRST TIME → mode: full-pipeline
  EXPANSION  → mode: curated-seed

Do you also want daily AI-generated facts?
  YES → mode: evergreen-boost (add to above)
  NO  → stick with curated-seed

Mode Comparison

Mode	Cost/Day	Operator Effort	Content Type	Best For
`news-only`	~$0.30	None (automated)	Current events	Steady state
`curated-seed`	One-time batch	Medium (run scripts)	Deep knowledge	Topic expansion
`evergreen-boost`	~$0.40	Low (toggle on)	Timeless AI facts	Supplementing news
`full-pipeline`	One-time + daily	High (monitor everything)	All types	Initial build

Topic Strategy

Which Topics to Seed First

Prioritize topics that are:

Quiz-friendly — facts with clear, verifiable answers (history, science, geography)
Entity-rich — many distinct things to learn about (sports players, countries, animals)
Evergreen — facts that don't expire (records, nature, space)

Avoid starting with topics that are:

Rapidly changing — current events are better served by news ingestion
Subjective — opinions, rankings, "best of" lists make poor quiz questions
Narrow — very niche topics produce thin, repetitive content

Recommended First-Seed Priority

# Tier 1: Seed these first (highest quiz potential)
topics:
  history: { priority: high, count: 700 }     # Deep, entity-rich, evergreen
  science: { priority: high, count: 500 }     # Verifiable, educational
  geography: { priority: high, count: 400 }   # Clear answers, visual

# Tier 2: Seed after Tier 1 is validated
  sports: { priority: medium, count: 300 }    # Popular, stat-heavy
  culture: { priority: medium, count: 300 }   # Broad appeal
  animals: { priority: medium, count: 200 }   # Fun facts, educational

# Tier 3: Fill in later
  technology: { priority: low, count: 150 }
  food-beverage: { priority: low, count: 100 }
  space: { priority: low, count: 100 }

Subcategory Balance

Within a topic, distribute counts unevenly — give more to subcategories with richer entity pools:

# Good: weighted by entity richness
history:
  subcategories:
    - name: "Historic Figures"
      count: 200      # Huge entity pool (thousands of notable people)
    - name: "Ancient Civilizations"
      count: 150      # Rich but bounded
    - name: "Modern History"
      count: 100      # Events > entities
    - name: "Post-War & Contemporary"
      count: 50       # Smaller, more recent

# Bad: equal distribution ignores entity density
history:
  subcategories:
    - name: "Historic Figures"
      count: 125
    - name: "Ancient Civilizations"
      count: 125
    - name: "Modern History"
      count: 125
    - name: "Post-War & Contemporary"
      count: 125      # Will produce thin, repetitive content

Volume Tuning

Richness Tier Guide

The richness tier controls how many facts the AI generates per entity. Choose based on the topic:

Tier	Facts/Entity	Best For	Example
`high`	50-100	Entities with deep, varied factual content	"Albert Einstein", "Ancient Rome", "Michael Jordan"
`medium`	20-50	Entities with moderate factual depth	"Uranium", "Costa Rica", "Bluetooth"
`low`	10-20	Entities where facts are limited or repetitive	"Helvetica font", "Quinoa", "Podcasting"

Rule of thumb: If you can think of 20+ interesting facts about a typical entity in the topic, use high. If you'd struggle past 10, use low.

Batch Size Planning

Entities × Facts/Entity = Total Facts
Total Facts × Styles = Total Challenges
Total Facts × $0.001/style = Challenge Generation Cost

Example (all 6 styles):
  500 entities × 35 facts (medium) = 17,500 facts
  17,500 × 6 = 105,000 challenges
  17,500 × $0.006 = ~$105 challenge cost
  + ~$6 explosion cost + ~$2 validation ≈ $113 total

Example (3 core styles — mc,dq,ftg):
  500 entities × 35 facts = 17,500 facts
  17,500 × 3 = 52,500 challenges
  17,500 × $0.003 = ~$52 challenge cost
  + ~$6 explosion cost + ~$2 validation ≈ $60 total

Incremental Seeding

Don't try to seed everything at once. Build incrementally:

Week 1: 200 entities across history + science (Tier 1)
  → Validate quality, check quiz UX
  → Estimated: ~7,000 facts, ~$45

Week 2: 300 entities across geography + sports + culture (Tier 2)
  → Build on learnings from Week 1
  → Estimated: ~10,000 facts, ~$68

Week 3: Enable evergreen-boost (20 facts/day ongoing)
  → Automated, low-cost supplement
  → Estimated: ~$0.10/day

Week 4: Expand weak topics, backfill challenge content gaps
  → Use --audit to find gaps
  → Targeted, efficient

Cost Management

Cost Breakdown by Phase

For a typical 500-entity medium-richness run:

Phase               Cost        % of Total    Duration
─────────────────────────────────────────────────────────
Entity generation   ~$1.00      1.5%          5 min
Fact explosion      ~$5.00      7.3%          2-4 hours
Validation          ~$2.00      2.9%          30 min
Challenge content   ~$60.00     87.0%         8-12 hours
Content cleanup     ~$0.00      0.0%          (skip unless needed)
─────────────────────────────────────────────────────────
Total               ~$68.00     100%          ~12-16 hours

Challenge content dominates cost. This is because each fact generates 6 detailed challenge styles with setup text, challenge text, reveal text, and a 3-6 sentence correct answer. If budget is tight:

Cost Reduction Strategies

Skip challenge generation initially. Seed entities and facts first, generate challenges later when budget allows. Facts are useful without challenges (they still appear on cards).
Generate fewer styles. Set challenge_styles: mc,dq,ftg in SEED.md (or --styles mc,dq,ftg on the CLI) to generate only the 3 core styles. This halves challenge cost. You can add the remaining styles later — the DB composite key supports incremental style addition.
Use lower richness tiers. Dropping from medium (35 facts/entity) to low (15 facts/entity) cuts all downstream costs by ~60%.
Partition across days. Run 125 entities per day over 4 days instead of 500 at once. Spreads API cost and lets you catch issues early.

Budget Templates

# Budget: $25 (starter)
volume:
  max_entities: 150
  richness_tier: low
  challenge_difficulty: 1       # Single difficulty to control cost
# Expected: ~2,250 facts, ~13,500 challenges

# Budget: $75 (standard)
volume:
  max_entities: 500
  richness_tier: medium
  challenge_difficulty: 1       # Start with easy, expand later
# Expected: ~17,500 facts, ~105,000 challenges

# Budget: $200 (comprehensive)
volume:
  max_entities: 1500
  richness_tier: medium
  challenge_difficulty: 0       # Balanced spread — all 5 difficulty levels
quality:
  cleanup_after_seed: true
# Expected: ~52,500 facts, ~315,000 challenges

Quality Assurance

Pre-Seed Checklist

Before starting a seeding run:

bun scripts/seed/generate-challenge-content.ts --audit — check current coverage
Verify .env.local has OPENAI_API_KEY and DATABASE_URL
Check seed_entry_queue for stuck entries: no status = 'processing' older than 10 min
Check Redis queue depth: no large backlog from previous runs
Run --dry-run first and review the output

Post-Seed Validation

After a seeding run completes:

# 1. Check fact quality
bun scripts/seed/generate-challenge-content.ts --validate

# 2. Check for NULL metadata
bun scripts/seed/backfill-fact-nulls.ts --audit

# 3. Sample review (manually read 10-20 facts)
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const sample = await db.execute(sql\`
  SELECT title, challenge_title, notability_score, topic_category_id
  FROM fact_records
  WHERE source_type = 'file_seed' AND status = 'validated'
  ORDER BY RANDOM() LIMIT 10
\`);
for (const row of sample) console.log(row.title, '|', row.notability_score);
process.exit(0);
"

Quality Signals to Watch

Signal	Healthy	Warning	Action
Notability scores	0.6-0.9 average	Below 0.5 average	Run `--recover` to reprocess weak facts
CQ-002 patch rate	< 30% patched	> 50% patched	AI prompt may need tuning
Error rate	< 1%	> 5%	Check API rate limits, reduce concurrency
Duplicate facts	< 2%	> 5%	Check dedup logic, may need manual cleanup
Challenge coverage	Expected styles per fact	< expected styles	Run `--recover` for missing styles

Worked Examples

Example 1: First-Time Science Seeding

Goal: Build a deep science knowledge base from scratch.

SEED.md configuration:

mode: curated-seed

topics:
  science:
    enabled: true
    priority: high
    subcategories:
      - name: "Physics & Space"
        count: 100
      - name: "Biology & Medicine"
        count: 80
      - name: "Chemistry & Materials"
        count: 60
      - name: "Earth & Environmental"
        count: 60

volume:
  richness_tier: medium
  max_entities: 300
  challenge_difficulty: 1

quality:
  generate_challenges: true
  cleanup_after_seed: false

execution:
  concurrency: 5
  partitions: 4
  dry_run_first: true
  auto_upload: false

Expected outcome:

~300 entities (Newton, DNA, Periodic Table, Black Holes, etc.)
~10,500 facts (35 avg per entity at medium tier)
~63,000 challenges (6 styles per fact)
Cost: ~$70
Duration: ~10-14 hours

Session commands:

"Read SEED.md and execute. Start with dry-run."
→ Claude previews: 300 entities, 4 subcategories, ~$70 estimate
→ You confirm

"Proceed with generation."
→ Claude runs: generate-curated-entries, bulk-enqueue, monitors explosion
→ Reports: 10,342 facts generated, 23 errors

"Generate challenge content."
→ Claude runs: generate-challenge-content --generate --partition 1/4 (x4)
→ Reports: 62,052 challenges, $63.20 cost

"Validate and upload."
→ Claude runs: --validate (quality check), --upload (DB upsert)
→ Reports: 100% coverage, 28% CQ-002 patched, 0 skipped

Example 2: Targeted Topic Expansion

Goal: Add 2,000 geography facts to supplement an existing base.

SEED.md configuration:

mode: curated-seed

topics:
  geography:
    enabled: true
    priority: high
    subcategories:
      - name: "Countries & Capitals"
        count: 200
      - name: "Natural Wonders"
        count: 80
      - name: "Rivers, Mountains & Oceans"
        count: 60
      - name: "Cultural Geography"
        count: 40

volume:
  richness_tier: low       # Geography facts are concise
  max_entities: 380
  max_facts: 6000          # Cap to control cost
  challenge_difficulty: 1

execution:
  concurrency: 5
  partitions: 2            # Smaller run, 2 partitions enough
  dry_run_first: true

Expected outcome:

~380 entities
~5,700 facts (15 avg at low tier)
~34,200 challenges
Cost: ~$38
Duration: ~6-8 hours

Example 3: Enabling Evergreen for Steady Growth

Goal: Add 20 AI-generated timeless facts per day to keep content fresh.

SEED.md configuration:

mode: evergreen-boost

evergreen:
  enabled: true
  daily_quota: 20
  distribution:
    science: 20%
    history: 20%
    geography: 15%
    culture: 15%
    sports: 10%
    animals: 10%
    records: 10%

What Claude does:

Sets EVERGREEN_ENABLED=true in .env.local
Sets EVERGREEN_DAILY_QUOTA=20
Updates topic_categories.percent_target for listed topics
Verifies the generate-evergreen cron is active
Reports: "Evergreen enabled. 20 facts/day, ~$0.10/day, distributed across 7 topics."

Expected outcome:

20 new facts/day (600/month)
Fully automated (no operator intervention)
Cost: ~$3/month
Challenge content generated automatically via queue

Example 4: Emergency Content Boost

Goal: You need 5,000 facts across entertainment and sports in 24 hours for a launch.

SEED.md configuration:

mode: curated-seed

topics:
  entertainment:
    enabled: true
    priority: high
    subcategories:
      - name: "Movies & Directors"
        count: 150
      - name: "Music Artists & Albums"
        count: 150
      - name: "TV Shows"
        count: 100
  sports:
    enabled: true
    priority: high
    subcategories:
      - name: "Football (American)"
        count: 80
      - name: "Basketball"
        count: 80
      - name: "Soccer"
        count: 80
      - name: "Olympic Sports"
        count: 60

volume:
  richness_tier: high       # Maximum facts per entity
  max_facts: 5000
  challenge_difficulty: 1

execution:
  concurrency: 8            # Aggressive concurrency
  partitions: 8             # Maximum parallelism
  dry_run_first: false      # Skip preview, move fast
  auto_upload: true         # Auto-push to DB

Estimated cost: ~$35 (explosion) + ~$30 (challenges) = ~$65 Estimated duration: ~6-8 hours with 8 partitions at concurrency 8

Common Pitfalls

1. Starting a new run while workers are still processing

Symptom: seed_entry_queue has thousands of status = 'processing' rows.

Fix: Wait for the current run to finish, or reset stuck entries:

UPDATE seed_entry_queue SET status = 'pending'
WHERE status = 'processing' AND updated_at < NOW() - INTERVAL '10 minutes';

2. Running out of API budget mid-run

Symptom: Errors spike to 100%, all messages are rate limit errors.

Fix: Reduce concurrency and wait. The scripts are resume-safe — just re-run after the rate limit window resets.

3. Generating challenges before validation completes

Symptom: generate-challenge-content --export finds 0 facts to process.

Why: Facts must be status = 'validated' before challenge content can be generated. If validation workers haven't run, facts are still pending_validation.

Fix: Run bun run dev:worker-validate and wait for validation to complete before generating challenges.

4. Forgetting to upload after generation

Symptom: Challenges exist in JSONL files but not in the database. The feed shows facts without quiz content.

Fix: Run --upload to push JSONL data to the database:

bun scripts/seed/generate-challenge-content.ts --upload

5. Seeding topics that aren't in the taxonomy

Symptom: Entity generation succeeds but explosion fails with "topic_category_id not found."

Fix: Check that the topic slug exists in topic_categories and is is_active = true. New topics need a migration first.

6. Duplicate content across runs

Symptom: Same facts appearing multiple times on the feed.

Why: The dedup check uses title + topic_category_id. If you re-run entity generation with slightly different prompts, the AI may produce entities with different names that generate overlapping facts.

Prevention: Don't re-run generate-curated-entries.ts --insert for topics that already have entries. Use --stats to check first.

Monitoring & Observability

During a Seeding Run

# Check seed entry queue status
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
  SELECT status, COUNT(*) as count
  FROM seed_entry_queue
  GROUP BY status ORDER BY count DESC
\`);
console.table(result);
process.exit(0);
"

# Check fact generation progress
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
  SELECT source_type, status, COUNT(*) as count
  FROM fact_records
  WHERE created_at > NOW() - INTERVAL '24 hours'
  GROUP BY source_type, status ORDER BY count DESC
\`);
console.table(result);
process.exit(0);
"

After a Seeding Run

# Full audit
bun scripts/seed/generate-challenge-content.ts --audit

# Check topic distribution
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
  SELECT tc.name, COUNT(fr.id) as facts
  FROM fact_records fr
  JOIN topic_categories tc ON fr.topic_category_id = tc.id
  WHERE fr.status = 'validated'
  GROUP BY tc.name ORDER BY facts DESC
\`);
console.table(result);
process.exit(0);
"

Recovery Procedures

Script Crashed Mid-Run

All scripts are resume-safe. Just re-run the same command:

# Picks up where it left off (reads JSONL to find completed IDs)
bun scripts/seed/generate-challenge-content.ts --generate --concurrency 5

Bad Content Generated

If a batch of facts has quality issues:

# 1. Identify the problem batch
bun scripts/seed/generate-challenge-content.ts --validate

# 2. Delete bad challenge content
# (fact_records remain, only challenges are regenerated)
bun scripts/seed/generate-challenge-content.ts --recover

# 3. Or run a full cleanup pass
bun scripts/seed/cleanup-content.ts --fix --concurrency 5

Need to Undo an Upload

If bad data was uploaded to the database:

-- Archive (soft-delete) facts from a specific batch
UPDATE fact_records
SET status = 'archived'
WHERE source_type = 'file_seed'
AND created_at > '2026-02-18T00:00:00Z'
AND topic_category_id = (SELECT id FROM topic_categories WHERE slug = 'bad-topic');

Challenge content is automatically excluded when the parent fact is archived.

SEED.md — Seeding control prompt (edit this to direct seeding)
runbook.md — Detailed operational procedures
manual-seeding-guide.md — File-based seeding from XLSX/DOCX
01-taxonomy-expansion.md — Adding new topic categories
04-taxonomy-coherence.md — Category alias mapping
../../rules/challenge-content.md — Quality rules (CC/CQ)
logs/ — Seed job logs (per-run records with costs, errors, results)

#Seeding Best Practices

#Table of Contents

#Mental Model

#Golden Rules

#Choosing a Seeding Mode

#Decision Tree

#Mode Comparison

#Topic Strategy

#Which Topics to Seed First

#Recommended First-Seed Priority

#Subcategory Balance

#Volume Tuning

#Richness Tier Guide

#Batch Size Planning

#Incremental Seeding

#Cost Management

#Cost Breakdown by Phase

#Cost Reduction Strategies

#Budget Templates

#Quality Assurance

#Pre-Seed Checklist

#Post-Seed Validation

#Quality Signals to Watch

#Worked Examples

#Example 1: First-Time Science Seeding

#Example 2: Targeted Topic Expansion

#Example 3: Enabling Evergreen for Steady Growth

#Example 4: Emergency Content Boost

#Common Pitfalls

#1. Starting a new run while workers are still processing

#2. Running out of API budget mid-run

#3. Generating challenges before validation completes

#4. Forgetting to upload after generation

#5. Seeding topics that aren't in the taxonomy

#6. Duplicate content across runs

#Monitoring & Observability

#During a Seeding Run

#After a Seeding Run

#Recovery Procedures

#Script Crashed Mid-Run

#Bad Content Generated

#Need to Undo an Upload

#Related Documents

Seeding Best Practices

Table of Contents

Mental Model

Golden Rules

Choosing a Seeding Mode

Decision Tree

Mode Comparison

Topic Strategy

Which Topics to Seed First

Recommended First-Seed Priority

Subcategory Balance

Volume Tuning

Richness Tier Guide

Batch Size Planning

Incremental Seeding

Cost Management

Cost Breakdown by Phase

Cost Reduction Strategies

Budget Templates

Quality Assurance

Pre-Seed Checklist

Post-Seed Validation

Quality Signals to Watch

Worked Examples

Example 1: First-Time Science Seeding

Example 2: Targeted Topic Expansion

Example 3: Enabling Evergreen for Steady Growth

Example 4: Emergency Content Boost

Common Pitfalls

1. Starting a new run while workers are still processing

2. Running out of API budget mid-run

3. Generating challenges before validation completes

4. Forgetting to upload after generation

5. Seeding topics that aren't in the taxonomy

6. Duplicate content across runs

Monitoring & Observability

During a Seeding Run

After a Seeding Run

Recovery Procedures

Script Crashed Mid-Run

Bad Content Generated

Need to Undo an Upload

Related Documents