Seed Pipeline Documentation
The seed pipeline populates the Eko platform with structured facts from legacy content files (XLSX, DOCX, CSV). It uses a multi-stage architecture: file parsing, AI-powered "explosion" into individual facts, validation, and super-fact discovery.
Quick Links
| Document | Purpose |
|---|---|
| SEED.md | Seeding control prompt — edit this to direct what gets seeded |
| seeding-best-practices.md | Strategies, examples, cost management, and pitfalls |
| runbook.md | Step-by-step operational procedures |
| TODO.md | Progress tracker for all seeding workstreams |
| logs/ | Seed job logs — structured per-job records with costs, errors, and results |
Architecture
Legacy Files / Curated Entries / News APIs / Evergreen AI
|
[Scripts / Crons]
|
seed_entry_queue (DB)
|
[bulk-enqueue.ts]
|
Redis Queue (EXPLODE_CATEGORY_ENTRY)
|
[worker-facts] ──> AI (gpt-5-mini via ModelAdapter) ──> fact_records (DB)
| |
spin-off entries ──> seed_entry_queue [worker-validate]
|
validated facts
|
[generate-challenge-content.ts]
|
fact_challenge_content (DB)
Key Components
| Component | Path | Purpose |
|---|---|---|
| CLI Orchestrator | scripts/seed/seed-from-files.ts | Parse files, dispatch to queues, show stats |
| Bulk Enqueue | scripts/seed/bulk-enqueue.ts | Fast batch enqueue using enqueueMany pipeline |
| Explosion Worker | apps/worker-facts/src/handlers/explode-entry.ts | AI-powered fact extraction from entries |
| Import Handler | apps/worker-facts/src/handlers/import-facts.ts | Batch insert facts into fact_records |
| Category Mapper | scripts/seed/lib/category-mapper.ts | Map file content to topic categories |
| File Parsers | scripts/seed/lib/parsers/ | XLSX, DOCX, CSV content parsers |
Database Tables
| Table | Purpose |
|---|---|
seed_entry_queue | Work queue for entries pending explosion |
fact_records | Generated facts with source_type='file_seed' |
fact_record_schemas | Schema definitions per topic category |
topic_categories | 31 active root topic categories (depth 0) |
topic_category_aliases | External provider slug → internal category mapping |
fact_challenge_content | Pre-generated quiz content (6 styles per fact) |
super_fact_links | Cross-entry correlations |
CLI Commands
# Parse files into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse --dry-run # Preview without DB writes
bun scripts/seed/seed-from-files.ts --parse # Insert entries
# Dispatch entries to worker queue (slow, one-by-one)
bun scripts/seed/seed-from-files.ts --explode --topic entertainment --batch-size 500
# Fast bulk dispatch using Redis pipeline (recommended for large runs)
bun scripts/seed/bulk-enqueue.ts
# Process spin-off entries
bun scripts/seed/seed-from-files.ts --explode-spinoffs
# View pipeline dashboard
bun scripts/seed/seed-from-files.ts --stats
Running Workers
Workers consume from Upstash Redis queues. Use WORKER_CONCURRENCY to control parallel processing per worker instance.
# Single worker, default concurrency (1)
bun run dev:worker-facts
# High-throughput: multiple workers with concurrency
WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
WORKER_CONCURRENCY=10 PORT=4011 bun run dev:worker-facts
# ... up to N workers per API key
# Dual API key setup for 2x rate limit pool
OPENAI_API_KEY=key1 WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
OPENAI_API_KEY=key2 WORKER_CONCURRENCY=10 PORT=4020 bun run dev:worker-facts
Throughput & Rate Limits
Current Model: gpt-5-mini (via ModelAdapter)
AI tiers route through the ModelAdapter pattern with gpt-5-mini as default. Available models: gpt-5-mini, gemini-2.5-flash, gemini-3-flash-preview, claude-haiku-4-5. See SEED.md for cost estimates and seeding-best-practices.md for volume tuning guidance.
Cost Estimates (gpt-5-mini)
| Operation | Per-Unit Cost | Example |
|---|---|---|
| Entity generation | ~$0.002/entity | 500 entities = ~$1 |
| Fact explosion | ~$0.01/entity | 500 entities = ~$5 |
| Challenge content | ~$0.006/fact | 10,000 facts = ~$60 |
| Content cleanup | ~$0.004/fact | 10,000 facts = ~$40 |
| News extraction | ~$0.003/story | 100 stories/day = ~$0.30/day |
See seeding-best-practices.md for budget templates and cost reduction strategies.
Known Issues & Workarounds
Spinoff Category Inheritance
Problem: The AI explosion generates spin-off entries with suggestedTopicPath slugs (e.g., music/hip-hop-sampling) but not the canonical topic_category_id UUID. Without the UUID, spinoffs can't be processed.
Fix: Added topicCategoryId: topic_category_id to the insertSeedEntry call in explode-entry.ts (committed). For entries created before the fix, run:
-- Inherit topic_category_id from parent entries
UPDATE seed_entry_queue child
SET topic_category_id = parent.topic_category_id
FROM seed_entry_queue parent
WHERE child.parent_entry_id = parent.id
AND child.topic_category_id IS NULL
AND parent.topic_category_id IS NOT NULL;
Rate Limit Failures
Problem: Workers hitting TPM ceiling → 3 retries → DLQ → entry marked failed.
Fix: Reset failed entries and re-enqueue:
UPDATE seed_entry_queue SET status = 'pending' WHERE status = 'failed';
Then run bun scripts/seed/bulk-enqueue.ts to re-dispatch.
Slow CLI Enqueue
Problem: --explode dispatches entries one-by-one to Redis (1 HTTP call per entry), taking minutes for large batches.
Fix: Use scripts/seed/bulk-enqueue.ts which uses enqueueMany() for batched Redis pipeline calls (~60x faster).
Monitoring
Status Script
A bash monitoring script at /tmp/seed-status.sh aggregates progress across all workers:
bash /tmp/seed-status.sh
Shows: total explosions, facts generated, AI spend, per-worker breakdown, throughput rate, and ETA.
DB Queries
-- Entry status distribution
SELECT status, COUNT(*) FROM seed_entry_queue GROUP BY status;
-- Facts by source type
SELECT source_type, COUNT(*) FROM fact_records GROUP BY source_type;
-- Total AI spend
SELECT SUM(cost_usd::numeric) FROM ai_cost_log WHERE purpose LIKE '%seed%';
-- Entries with missing category
SELECT COUNT(*) FROM seed_entry_queue WHERE topic_category_id IS NULL AND status = 'pending';
Rollback
-- Remove all seeded facts
DELETE FROM fact_records WHERE source_type IN ('file_seed', 'ai_super_fact');
-- Reset all entries
UPDATE seed_entry_queue SET status = 'pending', facts_generated = 0, spinoffs_discovered = 0;
-- Or nuclear option: clear everything
DELETE FROM seed_entry_queue;
Migration History
| Migration | Purpose |
|---|---|
| 0096 | Initial topic categories + schemas |
| 0101 | Expanded categories |
| 0104 | seed_entry_queue table |
| 0105 | super_fact_links table |
| 0117 | source_type CHECK constraint on fact_records |
| 0120 | Fix schema formats and topic linkages |
| 0121 | fact_challenge_content table |
| 0122 | correct_answer column on fact_challenge_content |
| 0126 | topic_category_aliases + unmapped_category_log tables |
| 0127 | Deactivate 5 redundant root categories + depth column |
| 0128 | Taxonomy indexes + CHECK constraint (depth/parent coherence) |
Related Documents
- SEED.md — Seeding control prompt (edit to direct seeding operations)
- seeding-best-practices.md — Strategies, examples, cost management
- runbook.md — Step-by-step operational procedures
- TODO.md — Progress tracker for all seeding workstreams
- logs/ — Structured seed job logs with costs, errors, and results
- 04-taxonomy-coherence.md — Category alias mapping (provider → internal)
- 01-taxonomy-expansion.md — Subcategory materialization plan
- ../../rules/challenge-content.md — Quality rules (CC/CQ)
- APP-CONTROL.md — App control manifest (crons, workers, queues, APIs)
- Ops Logs — Operational event logging (parallel to seed logs)