Seeding Prompts
Prompts for bootstrapping topic categories with curated seed entries, managing the explosion pipeline, and running batch seeding operations.
Model routing: All seeding scripts use the Eko model router (selectModelForTask()) to pick models by task tier — they are not hardcoded to a single provider. The seed_explosion task maps to the default tier via TASK_TIER_MAP in packages/ai/src/fact-engine.ts. The tier-to-model mapping is read from the ai_model_tier_config DB table (60s cache), falling back to DEFAULT_TIER_CONFIG_DATA in code. The API key you need depends on which model the default tier resolves to. See the model testing index for the full list of models and their required env vars.
Prompts
| # | Prompt | Cost | Duration |
|---|---|---|---|
| 1 | Seed the Database | ~$68 | 1-3 hours |
| 2 | Generate Curated Entries | ~$2-5 | 10-30 min |
| 3 | Bulk Enqueue Entries | $0 | 2-5 min |
| 4 | Generate Challenge Content | $20-100 | 4-12 hours |
| 5 | Improve Titles | ~$0.003/fact | varies |
| 6 | Find Super Facts | ~$0.01/fact | varies |
| 7 | Run Voice Enforcement | ~$0.005/fact | varies |
| 8 | Seed from Files | $0 | 5-15 min |
| 9 | Tune Challenge Allocation | $0 | 5-10 min |
| 10 | Rewrite Challenge Defects | ~$0.003/challenge | varies |
FAQ
Concepts
What is the seeding pipeline and when do I use it?
- Bootstraps new topic categories with high-quality AI-generated content when news/evergreen alone is insufficient.
- Pipeline:
generate-curated-entries.ts-> DBseed_entry_queue->EXPLODE_CATEGORY_ENTRY(worker-facts) -> facts ->VALIDATE_FACT->GENERATE_CHALLENGE_CONTENT. - Use when a topic is empty or needs deeper coverage beyond automated news ingestion.
- 44 CategorySpec definitions in
packages/ai/src/config/categories.tscover all 33 active root categories (~49,000 seed entries total across 197 subcategories). - Controlled via
docs/projects/seeding/SEED.mddirectives -- seeding mode, topic targets, volume caps, cost controls.
What is the difference between curated entries, file seeds, and spinoff discoveries?
- Curated entries: AI-generated entity lists for a topic (e.g., "Notable physicists") -- the seed input stored in
seed_entry_queue. file_seed: Primary facts produced by exploding a curated entry (10-100 facts per entity depending on richness tier).spinoff_discovery: Tangential entities discovered during explosion (cross-references, related figures) -- re-inserted intoseed_entry_queuefor their own explosion pass.- All three are
source_typevalues onfact_records, enabling provenance tracking through the pipeline.
What are super facts and how does cross-entry correlation work?
- Super facts connect multiple seed entries (e.g., "Both Einstein and Bohr studied at ETH Zurich").
- Queue:
FIND_SUPER_FACTS-> worker-facts -> AI compares facts across entries within a batch to find meaningful connections. - Stored with
source_type = 'ai_super_fact', linked to 2-3 entities via thesuper_fact_linkstable. - Deprecated: Super Facts pipeline removed in March 2026. Handler no longer exists.
- Previously triggered via
seed-from-files.ts --super-factsor automatically after batch explosion.
Running Seeds
How do I seed a single topic category from scratch?
- Step 1: Generate curated entries:
bun scripts/seed/generate-curated-entries.ts --category science --insert. - Step 2: Enqueue explosion:
bun scripts/seed/bulk-enqueue.ts(processes all pending entries inseed_entry_queue). - Step 3: Start workers with higher concurrency:
WORKER_CONCURRENCY=5 bun run dev:worker-facts. - Step 4: Monitor via admin dashboard queue page or query
fact_recordscount for the topic.
How does the partition flag work for parallel seeding?
--partition N/Msplits entries into M equal partitions and processes only partition N.- Example:
--partition 1/3processes the first third,--partition 2/3the second,--partition 3/3the last. - Allows running multiple terminal sessions in parallel for faster seeding of large batches.
- Supported by:
generate-challenge-content.ts,rewrite-challenge-defects.ts,regen-voice-pass.ts,cleanup-content.ts,presplit-defects.ts.
What is the JSONL intermediary pattern and why does every script use it?
- Scripts write intermediate results to
.jsonlfiles before database insertion, enabling dry-run preview, cost estimation, interrupted-run resumption, and audit trail. - Pattern: generation phase writes to
.jsonlin a local data directory (e.g.,scripts/seed/.challenge-data/,scripts/seed/.llm-test-data/), then a separate upsert phase reads the.jsonland inserts to DB. - Example:
rewrite-challenge-defects.tswrites to.challenge-data/challenges-rewritten.jsonl, thenupsert-rewritten-challenges.tsreads it and applies to DB. - Decouples AI generation cost from DB insertion, so you can review results before committing them.
How do I estimate the cost of a seeding run before starting?
- Use
--dry-runflag (onseed-from-files.ts) to preview what would be generated without making API calls. - Cost table from SEED.md: entity generation ~$0.002/entity, fact explosion ~$0.01/entity, challenge content ~$0.006/fact, content cleanup ~$0.004/fact.
- Typical curated-seed run (500 entities, medium richness): ~$68 total ($1 generation + $5 explosion + $2 validation + $60 challenges).
- Volume controls in SEED.md:
max_entities,max_facts,richness_tier,challenge_styles(fewer styles = lower cost).
Scripts
What does each script in scripts/seed/ do?
generate-curated-entries.ts-- Generate entity lists for topics frompackages/ai/src/config/categories.ts(44 specs across 33 roots).bulk-enqueue.ts-- Batch-enqueueEXPLODE_CATEGORY_ENTRYmessages for all pending entries inseed_entry_queue.seed-from-files.ts-- Full orchestrator: parse CSV/XLSX/DOCX -> enqueue explosion -> spinoffs -> super-facts.generate-challenge-content.ts-- Generate/rewrite challenge content for existing facts (supports--partition).rewrite-challenge-defects.ts-- Fix CQ-rule violations in existing challenge content (supports--partition).improve-titles.ts-- Rewrite weak titles and challenge_titles.llm-fact-quality-testing.ts-- 5-phase pipeline to compare AI models across quality dimensions. Supports--commitfor local Supabase writes and--modelsfor any registered model (see model testing).cleanup-content.ts,backfill-fact-nulls.ts,materialize-entity-categories.ts,regen-voice-pass.ts-- Maintenance and backfill utilities.- Full index:
scripts/script-index.md.
How do I generate curated entries and what controls their volume?
- Command:
bun scripts/seed/generate-curated-entries.ts --category <slug> --insert. - Volume controlled by
packages/ai/src/config/categories.ts(44 CategorySpec definitions): each subcategory has acount(entities to generate, typically 25-200) and aprompt. - Richness tier (set in SEED.md
volume.richness_tier):high(50-100 facts/entity),medium(20-50),low(10-20). - Topic presets available in SEED.md:
all-active,news-popular,trivia-deep.
How does seed-from-files.ts work for importing from CSV/XLSX/DOCX?
- Full pipeline orchestrator with stages:
--parse(file ->seed_entry_queue),--explode(process pending entries),--explode-spinoffs,--super-facts,--all(run everything). - Command:
bun scripts/seed/seed-from-files.ts --parse --file data.csv --topic history --insert. - Supports CSV, XLSX, DOCX via dedicated parsers in
scripts/seed/lib/parsers/. - Additional flags:
--dry-run(preview),--budget <dollars>(AI spend cap),--resume(skip completed),--batch-size <n>.
Troubleshooting
How do I resume a seeding run that was interrupted?
- JSONL intermediary files in local data directories (e.g.,
.challenge-data/) preserve generation progress -- re-running the script skips already-written output. - For
seed-from-files.ts: use--resumeflag to skip entries withstatus = 'exploded'inseed_entry_queue. - For queue-based explosion: un-exploded entries remain
status = 'pending'and will be picked up by the nextbulk-enqueue.tsrun. - Check DLQ (
queue:explode_category_entry:dlq) for entries that failed 3 times and need manual intervention.
What controls are available in SEED.md?
- SEED.md (
docs/projects/seeding/SEED.md): seeding mode (news-only,curated-seed,evergreen-boost,full-pipeline), topic directives, volume controls, quality controls, execution concurrency. - Seed controls are now documented in SEED.md rather than in a runtime config file.
- Volume knobs:
richness_tier,max_entities,max_facts,challenge_styles,challenge_difficulty.
See Also
- SEED.md -- Seeding control directives
- Scripts:
scripts/seed/