Seed Pipeline Runbook
Operational procedures for running the seed pipeline.
Before starting: Use SEED.md to configure what gets seeded, and review seeding-best-practices.md for strategies and cost management.
Full Corpus Seeding Procedure
Phase 1: Parse Legacy Files
# Dry run to verify parsing
bun scripts/seed/seed-from-files.ts --parse --dry-run
# Insert entries into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse
Verify: bun scripts/seed/seed-from-files.ts --stats shows entries across topic categories.
Phase 2: Fix Category Inheritance
After parsing, spinoff entries from prior runs may lack topic_category_id. Fix them:
bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
UPDATE seed_entry_queue child
SET topic_category_id = parent.topic_category_id
FROM seed_entry_queue parent
WHERE child.parent_entry_id = parent.id
AND child.topic_category_id IS NULL
AND parent.topic_category_id IS NOT NULL
RETURNING child.id
\`);
console.log('Fixed:', result.length);
process.exit(0);
"
Phase 3: Bulk Enqueue
bun scripts/seed/bulk-enqueue.ts
This queries all pending entries with topic_category_id set and enqueues them to Redis in batches of 500 using pipeline calls.
Phase 4: Start Workers
Ensure OPENAI_API_KEY is set in .env.local (default model: gpt-5-mini via ModelAdapter).
# Key 1 workers (5 instances)
for port in 4010 4011 4012 4013 4014; do
WORKER_CONCURRENCY=10 PORT=$port bun run dev:worker-facts &
done
# Key 2 workers (5 instances, different API key)
for port in 4020 4021 4022 4023 4024; do
OPENAI_API_KEY=second-key WORKER_CONCURRENCY=10 PORT=$port bun run dev:worker-facts &
done
Phase 5: Monitor & Maintain
Every 10-15 minutes during the run:
- Check status:
bash /tmp/seed-status.sh - Fix new spinoff categories: Run the Phase 2 SQL
- Reset failed entries:
UPDATE seed_entry_queue SET status = 'pending' WHERE status = 'failed'; - Re-enqueue:
bun scripts/seed/bulk-enqueue.ts
Phase 6: Validation
After explosion completes:
bun run dev:worker-validate
This processes VALIDATE_FACT messages created during explosion.
Phase 7: Cleanup
# Final stats
bun scripts/seed/seed-from-files.ts --stats
# Verify facts
SELECT source_type, status, COUNT(*) FROM fact_records GROUP BY source_type, status;
# Check for orphaned entries
SELECT COUNT(*) FROM seed_entry_queue WHERE status = 'pending' AND topic_category_id IS NULL;
Content Cleanup (Full Corpus Rewrite)
Rewrites ALL facts across the entire corpus — title, challenge_title, context, notability_score, and notability_reason. Uses local-first JSONL architecture for crash-resilience.
Prerequisites: OPENAI_API_KEY set in .env.local, DATABASE_URL set for direct Postgres access.
Step 1: Audit
bun scripts/seed/cleanup-content.ts --audit
Reports: total facts by source_type and topic, quality issues (null fields, anti-patterns), and estimated cost.
Step 2: Export
bun scripts/seed/cleanup-content.ts --export
Dumps all fact_records to scripts/seed/.cleanup-data/facts-export.jsonl (gitignored). Paginated reads in batches of 1000.
Step 3: Fix (AI Rewrite)
# Dry run: preview 3 facts without writing
bun scripts/seed/cleanup-content.ts --fix --dry-run --limit 20
# Full run (5 concurrent batches of 20)
bun scripts/seed/cleanup-content.ts --fix --concurrency 5
Processes facts-export.jsonl → writes results to facts-fixed.jsonl. Resume-safe: skips facts already in the fixed file.
Step 4: Upload
# Dry run: show what would be uploaded
bun scripts/seed/cleanup-content.ts --upload --dry-run
# Bulk upload (500 rows per UPDATE statement)
bun scripts/seed/cleanup-content.ts --upload
Step 5: Validate
bun scripts/seed/cleanup-content.ts --validate
Re-runs quality audit, samples 20 random facts for manual review, reports cleanup cost.
Step 6: Recover Weak Outputs
# Classify weak facts (notability <= 0.5) and re-process recoverable ones
bun scripts/seed/cleanup-content.ts --recover
# Parallel execution across multiple instances
bun scripts/seed/cleanup-content.ts --recover --partition 1/3
bun scripts/seed/cleanup-content.ts --recover --partition 2/3
bun scripts/seed/cleanup-content.ts --recover --partition 3/3
The --recover phase classifies weak outputs (notability <= 0.5) into recoverable (rich source data) vs vague (poor source data) using a classifyFactData() heuristic. Recoverable facts are re-processed with full structured data in the prompt.
Flags:
--partition N/M— split work across M instances, process partition N (for parallel runs)--output-suffix— custom suffix for per-instance output files (e.g.,facts-fixed-p1.jsonl)
Challenge Content Generation (Full Corpus Backfill)
Generate pre-built challenge content for all validated facts using scripts/seed/generate-challenge-content.ts. Uses a 6-phase JSONL pipeline matching the cleanup-content.ts pattern.
Prerequisites: OPENAI_API_KEY set in .env.local, DATABASE_URL set for direct Postgres access.
Phase: Audit
bun scripts/seed/generate-challenge-content.ts --audit
Reports challenge content coverage — how many validated facts have 0, 1, 2, ... 6 styles generated.
Phase: Export
bun scripts/seed/generate-challenge-content.ts --export
Dumps validated facts missing challenge content to .challenge-data/facts-export.jsonl (gitignored).
Phase: Generate
# Dry run: preview without writing
bun scripts/seed/generate-challenge-content.ts --generate --dry-run --limit 10
# Full run with concurrency
bun scripts/seed/generate-challenge-content.ts --generate --concurrency 5 --difficulty 2
# Parallel across instances
bun scripts/seed/generate-challenge-content.ts --generate --partition 1/3 --concurrency 5
AI generates 6 styles per fact (multiple_choice, direct_question, fill_the_gap, statement_blank, reverse_lookup, free_text). Results written to .challenge-data/challenges-generated.jsonl.
Phase: Upload
bun scripts/seed/generate-challenge-content.ts --upload --dry-run
bun scripts/seed/generate-challenge-content.ts --upload
Bulk upserts generated content to the fact_challenge_content table.
Phase: Validate
bun scripts/seed/generate-challenge-content.ts --validate
Post-upload quality check against CC/CQ rules.
Phase: Recover
bun scripts/seed/generate-challenge-content.ts --recover
Re-processes facts with validation issues.
Flags: --dry-run, --limit N, --concurrency N, --partition N/M, --difficulty N (0-5, default: 1; 0 = balanced spread across all levels), --styles LIST (comma-separated: mc,dq,ftg,sb,rl,ft; default: all 6)
Note: The
.challenge-data/directory is gitignored and serves as intermediary storage between phases.
Backfill NULL Fact Records
Backfill missing metadata on existing fact records using scripts/seed/backfill-fact-nulls.ts.
Modes
# Report NULL counts by source type
bun scripts/seed/backfill-fact-nulls.ts --audit
# Backfill NULL notability_score with source-type defaults
bun scripts/seed/backfill-fact-nulls.ts --notability
# Defaults: espn/geonames/tmdb = 0.8, others = 0.75
# Enqueue GENERATE_CHALLENGE_CONTENT for validated facts with 0 content rows
bun scripts/seed/backfill-fact-nulls.ts --challenge-content
# Run all backfills
bun scripts/seed/backfill-fact-nulls.ts --all
# Preview only
bun scripts/seed/backfill-fact-nulls.ts --all --dry-run
Logging Seed Jobs
Every seed job should be logged for auditability, cost tracking, and debugging.
After Each Run
- Create a log file:
docs/projects/seeding/logs/YYYY-MM/YYYY-MM-DDThh-mm--<job-type>--<slug>.md - Use the log template — fill in frontmatter, config snapshot, timeline, results, and errors
- Update the monthly index.md with a summary row
- Set
status:tocompleted,failed, orpartial
What to Log
| Always Log | Optional |
|---|---|
| Job type, topics, and mode | Config snapshot (SEED.md settings) |
| Start/finish time | Execution timeline per phase |
| Facts/challenges generated | Error details and resolution |
| Total cost | Follow-up items |
| Final status |
See logs/README.md for full documentation and naming conventions.
Troubleshooting
Workers Not Processing
- Check Redis queue: entries may not be enqueued yet
- Run
bun scripts/seed/bulk-enqueue.ts - Verify workers are consuming: check worker logs for "Processing EXPLODE_CATEGORY_ENTRY"
High Error Rate
- All errors are likely rate limit (TPM) exhaustion
- Reduce workers or concurrency
- Wait for rate limit window to reset (1 minute)
- Reset failed entries and re-enqueue
Duplicate Facts
Facts have deduplication via getExistingTitlesForTopic() during explosion. If duplicates appear:
-- Find duplicate titles within same topic
SELECT title, topic_category_id, COUNT(*)
FROM fact_records
WHERE source_type = 'file_seed'
GROUP BY title, topic_category_id
HAVING COUNT(*) > 1;
Stuck Processing Entries
If entries show status = 'processing' for too long (>10 min), they may be abandoned:
UPDATE seed_entry_queue
SET status = 'pending'
WHERE status = 'processing'
AND updated_at < NOW() - INTERVAL '10 minutes';
AI Model API Issues
If the default model (gpt-5-mini) returns errors, you can temporarily switch tiers via the DB:
-- Emergency fallback (e.g., switch default tier to Gemini)
UPDATE ai_model_tier_config SET model = 'gemini-2.5-flash' WHERE tier = 'default';
UPDATE ai_model_tier_config SET model = 'claude-sonnet-4-5' WHERE tier = 'mid';
UPDATE ai_model_tier_config SET model = 'claude-sonnet-4-5' WHERE tier = 'high';
Tier config is cached for 60s in each worker process, so changes take effect within a minute. See ModelAdapter docs for available models.
Cleanup Script Resume
If the cleanup script crashes mid-run, just re-run --fix. It reads facts-fixed.jsonl to find already-processed IDs and skips them. No duplicate AI calls or duplicate writes.
See Also: Prompts
Reusable prompts for seeding and content operations live in Eko Prompts:
- Seed the Database — full corpus seeding pipeline
- Generate Challenge Content — batch challenge generation
- Rewrite Challenge Defects — fix CQ-rule violations
- Content Cleanup Pass — full corpus rewrite
- Backfill Null Metadata — patch missing fields