Seed Pipeline Runbook

Operational procedures for running the seed pipeline.

Before starting: Use SEED.md to configure what gets seeded, and review seeding-best-practices.md for strategies and cost management.

Full Corpus Seeding Procedure

Phase 1: Parse Legacy Files

# Dry run to verify parsing
bun scripts/seed/seed-from-files.ts --parse --dry-run

# Insert entries into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse

Verify: bun scripts/seed/seed-from-files.ts --stats shows entries across topic categories.

Phase 2: Fix Category Inheritance

After parsing, spinoff entries from prior runs may lack topic_category_id. Fix them:

bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
  UPDATE seed_entry_queue child
  SET topic_category_id = parent.topic_category_id
  FROM seed_entry_queue parent
  WHERE child.parent_entry_id = parent.id
  AND child.topic_category_id IS NULL
  AND parent.topic_category_id IS NOT NULL
  RETURNING child.id
\`);
console.log('Fixed:', result.length);
process.exit(0);
"

Phase 3: Bulk Enqueue

bun scripts/seed/bulk-enqueue.ts

This queries all pending entries with topic_category_id set and enqueues them to Redis in batches of 500 using pipeline calls.

Phase 4: Start Workers

Ensure OPENAI_API_KEY is set in .env.local (default model: gpt-5-mini via ModelAdapter).

# Key 1 workers (5 instances)
for port in 4010 4011 4012 4013 4014; do
  WORKER_CONCURRENCY=10 PORT=$port bun run dev:worker-facts &
done

# Key 2 workers (5 instances, different API key)
for port in 4020 4021 4022 4023 4024; do
  OPENAI_API_KEY=second-key WORKER_CONCURRENCY=10 PORT=$port bun run dev:worker-facts &
done

Phase 5: Monitor & Maintain

Every 10-15 minutes during the run:

  1. Check status: bash /tmp/seed-status.sh
  2. Fix new spinoff categories: Run the Phase 2 SQL
  3. Reset failed entries: UPDATE seed_entry_queue SET status = 'pending' WHERE status = 'failed';
  4. Re-enqueue: bun scripts/seed/bulk-enqueue.ts

Phase 6: Validation

After explosion completes:

bun run dev:worker-validate

This processes VALIDATE_FACT messages created during explosion.

Phase 7: Cleanup

# Final stats
bun scripts/seed/seed-from-files.ts --stats

# Verify facts
SELECT source_type, status, COUNT(*) FROM fact_records GROUP BY source_type, status;

# Check for orphaned entries
SELECT COUNT(*) FROM seed_entry_queue WHERE status = 'pending' AND topic_category_id IS NULL;

Content Cleanup (Full Corpus Rewrite)

Rewrites ALL facts across the entire corpus — title, challenge_title, context, notability_score, and notability_reason. Uses local-first JSONL architecture for crash-resilience.

Prerequisites: OPENAI_API_KEY set in .env.local, DATABASE_URL set for direct Postgres access.

Step 1: Audit

bun scripts/seed/cleanup-content.ts --audit

Reports: total facts by source_type and topic, quality issues (null fields, anti-patterns), and estimated cost.

Step 2: Export

bun scripts/seed/cleanup-content.ts --export

Dumps all fact_records to scripts/seed/.cleanup-data/facts-export.jsonl (gitignored). Paginated reads in batches of 1000.

Step 3: Fix (AI Rewrite)

# Dry run: preview 3 facts without writing
bun scripts/seed/cleanup-content.ts --fix --dry-run --limit 20

# Full run (5 concurrent batches of 20)
bun scripts/seed/cleanup-content.ts --fix --concurrency 5

Processes facts-export.jsonl → writes results to facts-fixed.jsonl. Resume-safe: skips facts already in the fixed file.

Step 4: Upload

# Dry run: show what would be uploaded
bun scripts/seed/cleanup-content.ts --upload --dry-run

# Bulk upload (500 rows per UPDATE statement)
bun scripts/seed/cleanup-content.ts --upload

Step 5: Validate

bun scripts/seed/cleanup-content.ts --validate

Re-runs quality audit, samples 20 random facts for manual review, reports cleanup cost.

Step 6: Recover Weak Outputs

# Classify weak facts (notability <= 0.5) and re-process recoverable ones
bun scripts/seed/cleanup-content.ts --recover

# Parallel execution across multiple instances
bun scripts/seed/cleanup-content.ts --recover --partition 1/3
bun scripts/seed/cleanup-content.ts --recover --partition 2/3
bun scripts/seed/cleanup-content.ts --recover --partition 3/3

The --recover phase classifies weak outputs (notability <= 0.5) into recoverable (rich source data) vs vague (poor source data) using a classifyFactData() heuristic. Recoverable facts are re-processed with full structured data in the prompt.

Flags:

  • --partition N/M — split work across M instances, process partition N (for parallel runs)
  • --output-suffix — custom suffix for per-instance output files (e.g., facts-fixed-p1.jsonl)

Challenge Content Generation (Full Corpus Backfill)

Generate pre-built challenge content for all validated facts using scripts/seed/generate-challenge-content.ts. Uses a 6-phase JSONL pipeline matching the cleanup-content.ts pattern.

Prerequisites: OPENAI_API_KEY set in .env.local, DATABASE_URL set for direct Postgres access.

Phase: Audit

bun scripts/seed/generate-challenge-content.ts --audit

Reports challenge content coverage — how many validated facts have 0, 1, 2, ... 6 styles generated.

Phase: Export

bun scripts/seed/generate-challenge-content.ts --export

Dumps validated facts missing challenge content to .challenge-data/facts-export.jsonl (gitignored).

Phase: Generate

# Dry run: preview without writing
bun scripts/seed/generate-challenge-content.ts --generate --dry-run --limit 10

# Full run with concurrency
bun scripts/seed/generate-challenge-content.ts --generate --concurrency 5 --difficulty 2

# Parallel across instances
bun scripts/seed/generate-challenge-content.ts --generate --partition 1/3 --concurrency 5

AI generates 6 styles per fact (multiple_choice, direct_question, fill_the_gap, statement_blank, reverse_lookup, free_text). Results written to .challenge-data/challenges-generated.jsonl.

Phase: Upload

bun scripts/seed/generate-challenge-content.ts --upload --dry-run
bun scripts/seed/generate-challenge-content.ts --upload

Bulk upserts generated content to the fact_challenge_content table.

Phase: Validate

bun scripts/seed/generate-challenge-content.ts --validate

Post-upload quality check against CC/CQ rules.

Phase: Recover

bun scripts/seed/generate-challenge-content.ts --recover

Re-processes facts with validation issues.

Flags: --dry-run, --limit N, --concurrency N, --partition N/M, --difficulty N (0-5, default: 1; 0 = balanced spread across all levels), --styles LIST (comma-separated: mc,dq,ftg,sb,rl,ft; default: all 6)

Note: The .challenge-data/ directory is gitignored and serves as intermediary storage between phases.


Backfill NULL Fact Records

Backfill missing metadata on existing fact records using scripts/seed/backfill-fact-nulls.ts.

Modes

# Report NULL counts by source type
bun scripts/seed/backfill-fact-nulls.ts --audit

# Backfill NULL notability_score with source-type defaults
bun scripts/seed/backfill-fact-nulls.ts --notability
# Defaults: espn/geonames/tmdb = 0.8, others = 0.75

# Enqueue GENERATE_CHALLENGE_CONTENT for validated facts with 0 content rows
bun scripts/seed/backfill-fact-nulls.ts --challenge-content

# Run all backfills
bun scripts/seed/backfill-fact-nulls.ts --all

# Preview only
bun scripts/seed/backfill-fact-nulls.ts --all --dry-run

Logging Seed Jobs

Every seed job should be logged for auditability, cost tracking, and debugging.

After Each Run

  1. Create a log file: docs/projects/seeding/logs/YYYY-MM/YYYY-MM-DDThh-mm--<job-type>--<slug>.md
  2. Use the log template — fill in frontmatter, config snapshot, timeline, results, and errors
  3. Update the monthly index.md with a summary row
  4. Set status: to completed, failed, or partial

What to Log

Always LogOptional
Job type, topics, and modeConfig snapshot (SEED.md settings)
Start/finish timeExecution timeline per phase
Facts/challenges generatedError details and resolution
Total costFollow-up items
Final status

See logs/README.md for full documentation and naming conventions.


Troubleshooting

Workers Not Processing

  1. Check Redis queue: entries may not be enqueued yet
  2. Run bun scripts/seed/bulk-enqueue.ts
  3. Verify workers are consuming: check worker logs for "Processing EXPLODE_CATEGORY_ENTRY"

High Error Rate

  1. All errors are likely rate limit (TPM) exhaustion
  2. Reduce workers or concurrency
  3. Wait for rate limit window to reset (1 minute)
  4. Reset failed entries and re-enqueue

Duplicate Facts

Facts have deduplication via getExistingTitlesForTopic() during explosion. If duplicates appear:

-- Find duplicate titles within same topic
SELECT title, topic_category_id, COUNT(*)
FROM fact_records
WHERE source_type = 'file_seed'
GROUP BY title, topic_category_id
HAVING COUNT(*) > 1;

Stuck Processing Entries

If entries show status = 'processing' for too long (>10 min), they may be abandoned:

UPDATE seed_entry_queue
SET status = 'pending'
WHERE status = 'processing'
AND updated_at < NOW() - INTERVAL '10 minutes';

AI Model API Issues

If the default model (gpt-5-mini) returns errors, you can temporarily switch tiers via the DB:

-- Emergency fallback (e.g., switch default tier to Gemini)
UPDATE ai_model_tier_config SET model = 'gemini-2.5-flash' WHERE tier = 'default';
UPDATE ai_model_tier_config SET model = 'claude-sonnet-4-5' WHERE tier = 'mid';
UPDATE ai_model_tier_config SET model = 'claude-sonnet-4-5' WHERE tier = 'high';

Tier config is cached for 60s in each worker process, so changes take effect within a minute. See ModelAdapter docs for available models.

Cleanup Script Resume

If the cleanup script crashes mid-run, just re-run --fix. It reads facts-fixed.jsonl to find already-processed IDs and skips them. No duplicate AI calls or duplicate writes.


See Also: Prompts

Reusable prompts for seeding and content operations live in Eko Prompts: