Seed Pipeline Runbook

Operational procedures for running the seed pipeline.

Before starting: Use SEED.md to configure what gets seeded, and review seeding-best-practices.md for strategies and cost management.

Full Corpus Seeding Procedure

Phase 1: Parse Legacy Files

# Dry run to verify parsing
bun scripts/seed/seed-from-files.ts --parse --dry-run

# Insert entries into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse

Verify: bun scripts/seed/seed-from-files.ts --stats shows entries across topic categories.

Phase 2: Fix Category Inheritance

After parsing, spinoff entries from prior runs may lack topic_category_id. Fix them:

bun -e "
import { getDrizzleClient } from './packages/db/src/drizzle/client.ts';
import { sql } from 'drizzle-orm';
const db = getDrizzleClient();
const result = await db.execute(sql\`
  UPDATE seed_entry_queue child
  SET topic_category_id = parent.topic_category_id
  FROM seed_entry_queue parent
  WHERE child.parent_entry_id = parent.id
  AND child.topic_category_id IS NULL
  AND parent.topic_category_id IS NOT NULL
  RETURNING child.id
\`);
console.log('Fixed:', result.length);
process.exit(0);
"

Phase 3: Bulk Enqueue

bun scripts/seed/bulk-enqueue.ts

This queries all pending entries with topic_category_id set and enqueues them to Redis in batches of 500 using pipeline calls.

Phase 4: Start Workers

Ensure OPENAI_API_KEY is set in .env.local (default model: gpt-5-mini via ModelAdapter).

# Key 1 workers (5 instances)
for port in 4010 4011 4012 4013 4014; do
  WORKER_CONCURRENCY=10 PORT=$port bun run dev:worker-facts &
done

# Key 2 workers (5 instances, different API key)
for port in 4020 4021 4022 4023 4024; do
  OPENAI_API_KEY=second-key WORKER_CONCURRENCY=10 PORT=$port bun run dev:worker-facts &
done

Phase 5: Monitor & Maintain

Every 10-15 minutes during the run:

Check status: bash /tmp/seed-status.sh
Fix new spinoff categories: Run the Phase 2 SQL
Reset failed entries: UPDATE seed_entry_queue SET status = 'pending' WHERE status = 'failed';
Re-enqueue: bun scripts/seed/bulk-enqueue.ts

Phase 6: Validation

After explosion completes:

bun run dev:worker-validate

This processes VALIDATE_FACT messages created during explosion.

Phase 7: Cleanup

# Final stats
bun scripts/seed/seed-from-files.ts --stats

# Verify facts
SELECT source_type, status, COUNT(*) FROM fact_records GROUP BY source_type, status;

# Check for orphaned entries
SELECT COUNT(*) FROM seed_entry_queue WHERE status = 'pending' AND topic_category_id IS NULL;

Content Cleanup (Full Corpus Rewrite)

Rewrites ALL facts across the entire corpus — title, challenge_title, context, notability_score, and notability_reason. Uses local-first JSONL architecture for crash-resilience.

Prerequisites: OPENAI_API_KEY set in .env.local, DATABASE_URL set for direct Postgres access.

Step 1: Audit

bun scripts/seed/cleanup-content.ts --audit

Reports: total facts by source_type and topic, quality issues (null fields, anti-patterns), and estimated cost.

Step 2: Export

bun scripts/seed/cleanup-content.ts --export

Dumps all fact_records to scripts/seed/.cleanup-data/facts-export.jsonl (gitignored). Paginated reads in batches of 1000.

Step 3: Fix (AI Rewrite)

# Dry run: preview 3 facts without writing
bun scripts/seed/cleanup-content.ts --fix --dry-run --limit 20

# Full run (5 concurrent batches of 20)
bun scripts/seed/cleanup-content.ts --fix --concurrency 5

Processes facts-export.jsonl → writes results to facts-fixed.jsonl. Resume-safe: skips facts already in the fixed file.

Step 4: Upload

# Dry run: show what would be uploaded
bun scripts/seed/cleanup-content.ts --upload --dry-run

# Bulk upload (500 rows per UPDATE statement)
bun scripts/seed/cleanup-content.ts --upload

Step 5: Validate

bun scripts/seed/cleanup-content.ts --validate

Re-runs quality audit, samples 20 random facts for manual review, reports cleanup cost.

Step 6: Recover Weak Outputs

# Classify weak facts (notability <= 0.5) and re-process recoverable ones
bun scripts/seed/cleanup-content.ts --recover

# Parallel execution across multiple instances
bun scripts/seed/cleanup-content.ts --recover --partition 1/3
bun scripts/seed/cleanup-content.ts --recover --partition 2/3
bun scripts/seed/cleanup-content.ts --recover --partition 3/3

The --recover phase classifies weak outputs (notability <= 0.5) into recoverable (rich source data) vs vague (poor source data) using a classifyFactData() heuristic. Recoverable facts are re-processed with full structured data in the prompt.

Flags:

--partition N/M — split work across M instances, process partition N (for parallel runs)
--output-suffix — custom suffix for per-instance output files (e.g., facts-fixed-p1.jsonl)

Challenge Content Generation (Full Corpus Backfill)

Generate pre-built challenge content for all validated facts using scripts/seed/generate-challenge-content.ts. Uses a 6-phase JSONL pipeline matching the cleanup-content.ts pattern.

Prerequisites: OPENAI_API_KEY set in .env.local, DATABASE_URL set for direct Postgres access.

Phase: Audit

bun scripts/seed/generate-challenge-content.ts --audit

Reports challenge content coverage — how many validated facts have 0, 1, 2, ... 6 styles generated.

Phase: Export

bun scripts/seed/generate-challenge-content.ts --export

Dumps validated facts missing challenge content to .challenge-data/facts-export.jsonl (gitignored).

Phase: Generate

# Dry run: preview without writing
bun scripts/seed/generate-challenge-content.ts --generate --dry-run --limit 10

# Full run with concurrency
bun scripts/seed/generate-challenge-content.ts --generate --concurrency 5 --difficulty 2

# Parallel across instances
bun scripts/seed/generate-challenge-content.ts --generate --partition 1/3 --concurrency 5

AI generates 6 styles per fact (multiple_choice, direct_question, fill_the_gap, statement_blank, reverse_lookup, free_text). Results written to .challenge-data/challenges-generated.jsonl.

Phase: Upload

bun scripts/seed/generate-challenge-content.ts --upload --dry-run
bun scripts/seed/generate-challenge-content.ts --upload

Bulk upserts generated content to the fact_challenge_content table.

Phase: Validate

bun scripts/seed/generate-challenge-content.ts --validate

Post-upload quality check against CC/CQ rules.

Phase: Recover

bun scripts/seed/generate-challenge-content.ts --recover

Re-processes facts with validation issues.

Flags: --dry-run, --limit N, --concurrency N, --partition N/M, --difficulty N (0-5, default: 1; 0 = balanced spread across all levels), --styles LIST (comma-separated: mc,dq,ftg,sb,rl,ft; default: all 6)

Note: The .challenge-data/ directory is gitignored and serves as intermediary storage between phases.

Backfill NULL Fact Records

Backfill missing metadata on existing fact records using scripts/seed/backfill-fact-nulls.ts.

Modes

# Report NULL counts by source type
bun scripts/seed/backfill-fact-nulls.ts --audit

# Backfill NULL notability_score with source-type defaults
bun scripts/seed/backfill-fact-nulls.ts --notability
# Defaults: espn/geonames/tmdb = 0.8, others = 0.75

# Enqueue GENERATE_CHALLENGE_CONTENT for validated facts with 0 content rows
bun scripts/seed/backfill-fact-nulls.ts --challenge-content

# Run all backfills
bun scripts/seed/backfill-fact-nulls.ts --all

# Preview only
bun scripts/seed/backfill-fact-nulls.ts --all --dry-run

Logging Seed Jobs

Every seed job should be logged for auditability, cost tracking, and debugging.

After Each Run

Create a log file: docs/projects/seeding/logs/YYYY-MM/YYYY-MM-DDThh-mm--<job-type>--<slug>.md
Use the log template — fill in frontmatter, config snapshot, timeline, results, and errors
Update the monthly index.md with a summary row
Set status: to completed, failed, or partial

What to Log

Always Log	Optional
Job type, topics, and mode	Config snapshot (SEED.md settings)
Start/finish time	Execution timeline per phase
Facts/challenges generated	Error details and resolution
Total cost	Follow-up items
Final status

See logs/README.md for full documentation and naming conventions.

Troubleshooting

Workers Not Processing

Check Redis queue: entries may not be enqueued yet
Run bun scripts/seed/bulk-enqueue.ts
Verify workers are consuming: check worker logs for "Processing EXPLODE_CATEGORY_ENTRY"

High Error Rate

All errors are likely rate limit (TPM) exhaustion
Reduce workers or concurrency
Wait for rate limit window to reset (1 minute)
Reset failed entries and re-enqueue

Duplicate Facts

Facts have deduplication via getExistingTitlesForTopic() during explosion. If duplicates appear:

-- Find duplicate titles within same topic
SELECT title, topic_category_id, COUNT(*)
FROM fact_records
WHERE source_type = 'file_seed'
GROUP BY title, topic_category_id
HAVING COUNT(*) > 1;

Stuck Processing Entries

If entries show status = 'processing' for too long (>10 min), they may be abandoned:

UPDATE seed_entry_queue
SET status = 'pending'
WHERE status = 'processing'
AND updated_at < NOW() - INTERVAL '10 minutes';

AI Model API Issues

If the default model (gpt-5-mini) returns errors, you can temporarily switch tiers via the DB:

-- Emergency fallback (e.g., switch default tier to Gemini)
UPDATE ai_model_tier_config SET model = 'gemini-2.5-flash' WHERE tier = 'default';
UPDATE ai_model_tier_config SET model = 'claude-sonnet-4-5' WHERE tier = 'mid';
UPDATE ai_model_tier_config SET model = 'claude-sonnet-4-5' WHERE tier = 'high';

Tier config is cached for 60s in each worker process, so changes take effect within a minute. See ModelAdapter docs for available models.

Cleanup Script Resume

If the cleanup script crashes mid-run, just re-run --fix. It reads facts-fixed.jsonl to find already-processed IDs and skips them. No duplicate AI calls or duplicate writes.

#Seed Pipeline Runbook

#Full Corpus Seeding Procedure

#Phase 1: Parse Legacy Files

#Phase 2: Fix Category Inheritance

#Phase 3: Bulk Enqueue

#Phase 4: Start Workers

#Phase 5: Monitor & Maintain

#Phase 6: Validation

#Phase 7: Cleanup

#Content Cleanup (Full Corpus Rewrite)

#Step 1: Audit

#Step 2: Export

#Step 3: Fix (AI Rewrite)

#Step 4: Upload

#Step 5: Validate

#Step 6: Recover Weak Outputs

#Challenge Content Generation (Full Corpus Backfill)

#Phase: Audit

#Phase: Export

#Phase: Generate

#Phase: Upload

#Phase: Validate

#Phase: Recover

#Backfill NULL Fact Records

#Modes

#Logging Seed Jobs

#After Each Run

#What to Log

#Troubleshooting

#Workers Not Processing

#High Error Rate

#Duplicate Facts

#Stuck Processing Entries

#AI Model API Issues

#Cleanup Script Resume

#See Also: Prompts