Seed Pipeline Documentation

The seed pipeline populates the Eko platform with structured facts from legacy content files (XLSX, DOCX, CSV). It uses a multi-stage architecture: file parsing, AI-powered "explosion" into individual facts, validation, and super-fact discovery.

Quick Links

Document	Purpose
SEED.md	Seeding control prompt — edit this to direct what gets seeded
seeding-best-practices.md	Strategies, examples, cost management, and pitfalls
runbook.md	Step-by-step operational procedures
TODO.md	Progress tracker for all seeding workstreams
logs/	Seed job logs — structured per-job records with costs, errors, and results

Architecture

Legacy Files / Curated Entries / News APIs / Evergreen AI
        |
    [Scripts / Crons]
        |
  seed_entry_queue (DB)
        |
    [bulk-enqueue.ts]
        |
  Redis Queue (EXPLODE_CATEGORY_ENTRY)
        |
    [worker-facts] ──> AI (gpt-5-mini via ModelAdapter) ──> fact_records (DB)
        |                                                      |
  spin-off entries ──> seed_entry_queue                  [worker-validate]
                                                               |
                                                       validated facts
                                                               |
                                                  [generate-challenge-content.ts]
                                                               |
                                                  fact_challenge_content (DB)

Key Components

Component	Path	Purpose
CLI Orchestrator	`scripts/seed/seed-from-files.ts`	Parse files, dispatch to queues, show stats
Bulk Enqueue	`scripts/seed/bulk-enqueue.ts`	Fast batch enqueue using `enqueueMany` pipeline
Explosion Worker	`apps/worker-facts/src/handlers/explode-entry.ts`	AI-powered fact extraction from entries
Import Handler	`apps/worker-facts/src/handlers/import-facts.ts`	Batch insert facts into `fact_records`
Category Mapper	`scripts/seed/lib/category-mapper.ts`	Map file content to topic categories
File Parsers	`scripts/seed/lib/parsers/`	XLSX, DOCX, CSV content parsers

Database Tables

Table	Purpose
`seed_entry_queue`	Work queue for entries pending explosion
`fact_records`	Generated facts with `source_type='file_seed'`
`fact_record_schemas`	Schema definitions per topic category
`topic_categories`	31 active root topic categories (depth 0)
`topic_category_aliases`	External provider slug → internal category mapping
`fact_challenge_content`	Pre-generated quiz content (6 styles per fact)
`super_fact_links`	Cross-entry correlations

CLI Commands

# Parse files into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse --dry-run   # Preview without DB writes
bun scripts/seed/seed-from-files.ts --parse              # Insert entries

# Dispatch entries to worker queue (slow, one-by-one)
bun scripts/seed/seed-from-files.ts --explode --topic entertainment --batch-size 500

# Fast bulk dispatch using Redis pipeline (recommended for large runs)
bun scripts/seed/bulk-enqueue.ts

# Process spin-off entries
bun scripts/seed/seed-from-files.ts --explode-spinoffs

# View pipeline dashboard
bun scripts/seed/seed-from-files.ts --stats

Running Workers

Workers consume from Upstash Redis queues. Use WORKER_CONCURRENCY to control parallel processing per worker instance.

# Single worker, default concurrency (1)
bun run dev:worker-facts

# High-throughput: multiple workers with concurrency
WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
WORKER_CONCURRENCY=10 PORT=4011 bun run dev:worker-facts
# ... up to N workers per API key

# Dual API key setup for 2x rate limit pool
OPENAI_API_KEY=key1 WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
OPENAI_API_KEY=key2 WORKER_CONCURRENCY=10 PORT=4020 bun run dev:worker-facts

Throughput & Rate Limits

Current Model: gpt-5-mini (via ModelAdapter)

AI tiers route through the ModelAdapter pattern with gpt-5-mini as default. Available models: gpt-5-mini, gemini-2.5-flash, gemini-3-flash-preview, claude-haiku-4-5. See SEED.md for cost estimates and seeding-best-practices.md for volume tuning guidance.

Cost Estimates (gpt-5-mini)

Operation	Per-Unit Cost	Example
Entity generation	~$0.002/entity	500 entities = ~$1
Fact explosion	~$0.01/entity	500 entities = ~$5
Challenge content	~$0.006/fact	10,000 facts = ~$60
Content cleanup	~$0.004/fact	10,000 facts = ~$40
News extraction	~$0.003/story	100 stories/day = ~$0.30/day

See seeding-best-practices.md for budget templates and cost reduction strategies.

Known Issues & Workarounds

Spinoff Category Inheritance

Problem: The AI explosion generates spin-off entries with suggestedTopicPath slugs (e.g., music/hip-hop-sampling) but not the canonical topic_category_id UUID. Without the UUID, spinoffs can't be processed.

Fix: Added topicCategoryId: topic_category_id to the insertSeedEntry call in explode-entry.ts (committed). For entries created before the fix, run:

-- Inherit topic_category_id from parent entries
UPDATE seed_entry_queue child
SET topic_category_id = parent.topic_category_id
FROM seed_entry_queue parent
WHERE child.parent_entry_id = parent.id
AND child.topic_category_id IS NULL
AND parent.topic_category_id IS NOT NULL;

Rate Limit Failures

Problem: Workers hitting TPM ceiling → 3 retries → DLQ → entry marked failed.

Fix: Reset failed entries and re-enqueue:

UPDATE seed_entry_queue SET status = 'pending' WHERE status = 'failed';

Then run bun scripts/seed/bulk-enqueue.ts to re-dispatch.

Slow CLI Enqueue

Problem: --explode dispatches entries one-by-one to Redis (1 HTTP call per entry), taking minutes for large batches.

Fix: Use scripts/seed/bulk-enqueue.ts which uses enqueueMany() for batched Redis pipeline calls (~60x faster).

Monitoring

Status Script

A bash monitoring script at /tmp/seed-status.sh aggregates progress across all workers:

bash /tmp/seed-status.sh

Shows: total explosions, facts generated, AI spend, per-worker breakdown, throughput rate, and ETA.

DB Queries

-- Entry status distribution
SELECT status, COUNT(*) FROM seed_entry_queue GROUP BY status;

-- Facts by source type
SELECT source_type, COUNT(*) FROM fact_records GROUP BY source_type;

-- Total AI spend
SELECT SUM(cost_usd::numeric) FROM ai_cost_log WHERE purpose LIKE '%seed%';

-- Entries with missing category
SELECT COUNT(*) FROM seed_entry_queue WHERE topic_category_id IS NULL AND status = 'pending';

Rollback

-- Remove all seeded facts
DELETE FROM fact_records WHERE source_type IN ('file_seed', 'ai_super_fact');

-- Reset all entries
UPDATE seed_entry_queue SET status = 'pending', facts_generated = 0, spinoffs_discovered = 0;

-- Or nuclear option: clear everything
DELETE FROM seed_entry_queue;

Migration History

Migration	Purpose
0096	Initial topic categories + schemas
0101	Expanded categories
0104	`seed_entry_queue` table
0105	`super_fact_links` table
0117	`source_type` CHECK constraint on `fact_records`
0120	Fix schema formats and topic linkages
0121	`fact_challenge_content` table
0122	`correct_answer` column on `fact_challenge_content`
0126	`topic_category_aliases` + `unmapped_category_log` tables
0127	Deactivate 5 redundant root categories + depth column
0128	Taxonomy indexes + CHECK constraint (depth/parent coherence)

SEED.md — Seeding control prompt (edit to direct seeding operations)
seeding-best-practices.md — Strategies, examples, cost management
runbook.md — Step-by-step operational procedures
TODO.md — Progress tracker for all seeding workstreams
logs/ — Structured seed job logs with costs, errors, and results
04-taxonomy-coherence.md — Category alias mapping (provider → internal)
01-taxonomy-expansion.md — Subcategory materialization plan
../../rules/challenge-content.md — Quality rules (CC/CQ)
APP-CONTROL.md — App control manifest (crons, workers, queues, APIs)
Ops Logs — Operational event logging (parallel to seed logs)

#Seed Pipeline Documentation

#Quick Links

#Architecture

#Key Components

#Database Tables

#CLI Commands

#Running Workers

#Throughput & Rate Limits

#Current Model: gpt-5-mini (via ModelAdapter)

#Cost Estimates (gpt-5-mini)

#Known Issues & Workarounds

#Spinoff Category Inheritance

#Rate Limit Failures

#Slow CLI Enqueue

#Monitoring

#Status Script

#DB Queries

#Rollback

#Migration History

#Related Documents