Seed Pipeline Documentation

The seed pipeline populates the Eko platform with structured facts from legacy content files (XLSX, DOCX, CSV). It uses a multi-stage architecture: file parsing, AI-powered "explosion" into individual facts, validation, and super-fact discovery.

DocumentPurpose
SEED.mdSeeding control prompt — edit this to direct what gets seeded
seeding-best-practices.mdStrategies, examples, cost management, and pitfalls
runbook.mdStep-by-step operational procedures
TODO.mdProgress tracker for all seeding workstreams
logs/Seed job logs — structured per-job records with costs, errors, and results

Architecture

Legacy Files / Curated Entries / News APIs / Evergreen AI
        |
    [Scripts / Crons]
        |
  seed_entry_queue (DB)
        |
    [bulk-enqueue.ts]
        |
  Redis Queue (EXPLODE_CATEGORY_ENTRY)
        |
    [worker-facts] ──> AI (gpt-5-mini via ModelAdapter) ──> fact_records (DB)
        |                                                      |
  spin-off entries ──> seed_entry_queue                  [worker-validate]
                                                               |
                                                       validated facts
                                                               |
                                                  [generate-challenge-content.ts]
                                                               |
                                                  fact_challenge_content (DB)

Key Components

ComponentPathPurpose
CLI Orchestratorscripts/seed/seed-from-files.tsParse files, dispatch to queues, show stats
Bulk Enqueuescripts/seed/bulk-enqueue.tsFast batch enqueue using enqueueMany pipeline
Explosion Workerapps/worker-facts/src/handlers/explode-entry.tsAI-powered fact extraction from entries
Import Handlerapps/worker-facts/src/handlers/import-facts.tsBatch insert facts into fact_records
Category Mapperscripts/seed/lib/category-mapper.tsMap file content to topic categories
File Parsersscripts/seed/lib/parsers/XLSX, DOCX, CSV content parsers

Database Tables

TablePurpose
seed_entry_queueWork queue for entries pending explosion
fact_recordsGenerated facts with source_type='file_seed'
fact_record_schemasSchema definitions per topic category
topic_categories31 active root topic categories (depth 0)
topic_category_aliasesExternal provider slug → internal category mapping
fact_challenge_contentPre-generated quiz content (6 styles per fact)
super_fact_linksCross-entry correlations

CLI Commands

# Parse files into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse --dry-run   # Preview without DB writes
bun scripts/seed/seed-from-files.ts --parse              # Insert entries

# Dispatch entries to worker queue (slow, one-by-one)
bun scripts/seed/seed-from-files.ts --explode --topic entertainment --batch-size 500

# Fast bulk dispatch using Redis pipeline (recommended for large runs)
bun scripts/seed/bulk-enqueue.ts

# Process spin-off entries
bun scripts/seed/seed-from-files.ts --explode-spinoffs

# View pipeline dashboard
bun scripts/seed/seed-from-files.ts --stats

Running Workers

Workers consume from Upstash Redis queues. Use WORKER_CONCURRENCY to control parallel processing per worker instance.

# Single worker, default concurrency (1)
bun run dev:worker-facts

# High-throughput: multiple workers with concurrency
WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
WORKER_CONCURRENCY=10 PORT=4011 bun run dev:worker-facts
# ... up to N workers per API key

# Dual API key setup for 2x rate limit pool
OPENAI_API_KEY=key1 WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
OPENAI_API_KEY=key2 WORKER_CONCURRENCY=10 PORT=4020 bun run dev:worker-facts

Throughput & Rate Limits

Current Model: gpt-5-mini (via ModelAdapter)

AI tiers route through the ModelAdapter pattern with gpt-5-mini as default. Available models: gpt-5-mini, gemini-2.5-flash, gemini-3-flash-preview, claude-haiku-4-5. See SEED.md for cost estimates and seeding-best-practices.md for volume tuning guidance.

Cost Estimates (gpt-5-mini)

OperationPer-Unit CostExample
Entity generation~$0.002/entity500 entities = ~$1
Fact explosion~$0.01/entity500 entities = ~$5
Challenge content~$0.006/fact10,000 facts = ~$60
Content cleanup~$0.004/fact10,000 facts = ~$40
News extraction~$0.003/story100 stories/day = ~$0.30/day

See seeding-best-practices.md for budget templates and cost reduction strategies.

Known Issues & Workarounds

Spinoff Category Inheritance

Problem: The AI explosion generates spin-off entries with suggestedTopicPath slugs (e.g., music/hip-hop-sampling) but not the canonical topic_category_id UUID. Without the UUID, spinoffs can't be processed.

Fix: Added topicCategoryId: topic_category_id to the insertSeedEntry call in explode-entry.ts (committed). For entries created before the fix, run:

-- Inherit topic_category_id from parent entries
UPDATE seed_entry_queue child
SET topic_category_id = parent.topic_category_id
FROM seed_entry_queue parent
WHERE child.parent_entry_id = parent.id
AND child.topic_category_id IS NULL
AND parent.topic_category_id IS NOT NULL;

Rate Limit Failures

Problem: Workers hitting TPM ceiling → 3 retries → DLQ → entry marked failed.

Fix: Reset failed entries and re-enqueue:

UPDATE seed_entry_queue SET status = 'pending' WHERE status = 'failed';

Then run bun scripts/seed/bulk-enqueue.ts to re-dispatch.

Slow CLI Enqueue

Problem: --explode dispatches entries one-by-one to Redis (1 HTTP call per entry), taking minutes for large batches.

Fix: Use scripts/seed/bulk-enqueue.ts which uses enqueueMany() for batched Redis pipeline calls (~60x faster).

Monitoring

Status Script

A bash monitoring script at /tmp/seed-status.sh aggregates progress across all workers:

bash /tmp/seed-status.sh

Shows: total explosions, facts generated, AI spend, per-worker breakdown, throughput rate, and ETA.

DB Queries

-- Entry status distribution
SELECT status, COUNT(*) FROM seed_entry_queue GROUP BY status;

-- Facts by source type
SELECT source_type, COUNT(*) FROM fact_records GROUP BY source_type;

-- Total AI spend
SELECT SUM(cost_usd::numeric) FROM ai_cost_log WHERE purpose LIKE '%seed%';

-- Entries with missing category
SELECT COUNT(*) FROM seed_entry_queue WHERE topic_category_id IS NULL AND status = 'pending';

Rollback

-- Remove all seeded facts
DELETE FROM fact_records WHERE source_type IN ('file_seed', 'ai_super_fact');

-- Reset all entries
UPDATE seed_entry_queue SET status = 'pending', facts_generated = 0, spinoffs_discovered = 0;

-- Or nuclear option: clear everything
DELETE FROM seed_entry_queue;

Migration History

MigrationPurpose
0096Initial topic categories + schemas
0101Expanded categories
0104seed_entry_queue table
0105super_fact_links table
0117source_type CHECK constraint on fact_records
0120Fix schema formats and topic linkages
0121fact_challenge_content table
0122correct_answer column on fact_challenge_content
0126topic_category_aliases + unmapped_category_log tables
0127Deactivate 5 redundant root categories + depth column
0128Taxonomy indexes + CHECK constraint (depth/parent coherence)