Seeding Prompts

Prompts for bootstrapping topic categories with curated seed entries, managing the explosion pipeline, and running batch seeding operations.

Model routing: All seeding scripts use the Eko model router (selectModelForTask()) to pick models by task tier — they are not hardcoded to a single provider. The seed_explosion task maps to the default tier via TASK_TIER_MAP in packages/ai/src/fact-engine.ts. The tier-to-model mapping is read from the ai_model_tier_config DB table (60s cache), falling back to DEFAULT_TIER_CONFIG_DATA in code. The API key you need depends on which model the default tier resolves to. See the model testing index for the full list of models and their required env vars.

Prompts

#PromptCostDuration
1Seed the Database~$681-3 hours
2Generate Curated Entries~$2-510-30 min
3Bulk Enqueue Entries$02-5 min
4Generate Challenge Content$20-1004-12 hours
5Improve Titles~$0.003/factvaries
6Find Super Facts~$0.01/factvaries
7Run Voice Enforcement~$0.005/factvaries
8Seed from Files$05-15 min
9Tune Challenge Allocation$05-10 min
10Rewrite Challenge Defects~$0.003/challengevaries

FAQ

Concepts

What is the seeding pipeline and when do I use it?

  • Bootstraps new topic categories with high-quality AI-generated content when news/evergreen alone is insufficient.
  • Pipeline: generate-curated-entries.ts -> DB seed_entry_queue -> EXPLODE_CATEGORY_ENTRY (worker-facts) -> facts -> VALIDATE_FACT -> GENERATE_CHALLENGE_CONTENT.
  • Use when a topic is empty or needs deeper coverage beyond automated news ingestion.
  • 44 CategorySpec definitions in packages/ai/src/config/categories.ts cover all 33 active root categories (~49,000 seed entries total across 197 subcategories).
  • Controlled via docs/projects/seeding/SEED.md directives -- seeding mode, topic targets, volume caps, cost controls.

What is the difference between curated entries, file seeds, and spinoff discoveries?

  • Curated entries: AI-generated entity lists for a topic (e.g., "Notable physicists") -- the seed input stored in seed_entry_queue.
  • file_seed: Primary facts produced by exploding a curated entry (10-100 facts per entity depending on richness tier).
  • spinoff_discovery: Tangential entities discovered during explosion (cross-references, related figures) -- re-inserted into seed_entry_queue for their own explosion pass.
  • All three are source_type values on fact_records, enabling provenance tracking through the pipeline.

What are super facts and how does cross-entry correlation work?

  • Super facts connect multiple seed entries (e.g., "Both Einstein and Bohr studied at ETH Zurich").
  • Queue: FIND_SUPER_FACTS -> worker-facts -> AI compares facts across entries within a batch to find meaningful connections.
  • Stored with source_type = 'ai_super_fact', linked to 2-3 entities via the super_fact_links table.
  • Deprecated: Super Facts pipeline removed in March 2026. Handler no longer exists.
  • Previously triggered via seed-from-files.ts --super-facts or automatically after batch explosion.

Running Seeds

How do I seed a single topic category from scratch?

  • Step 1: Generate curated entries: bun scripts/seed/generate-curated-entries.ts --category science --insert.
  • Step 2: Enqueue explosion: bun scripts/seed/bulk-enqueue.ts (processes all pending entries in seed_entry_queue).
  • Step 3: Start workers with higher concurrency: WORKER_CONCURRENCY=5 bun run dev:worker-facts.
  • Step 4: Monitor via admin dashboard queue page or query fact_records count for the topic.

How does the partition flag work for parallel seeding?

  • --partition N/M splits entries into M equal partitions and processes only partition N.
  • Example: --partition 1/3 processes the first third, --partition 2/3 the second, --partition 3/3 the last.
  • Allows running multiple terminal sessions in parallel for faster seeding of large batches.
  • Supported by: generate-challenge-content.ts, rewrite-challenge-defects.ts, regen-voice-pass.ts, cleanup-content.ts, presplit-defects.ts.

What is the JSONL intermediary pattern and why does every script use it?

  • Scripts write intermediate results to .jsonl files before database insertion, enabling dry-run preview, cost estimation, interrupted-run resumption, and audit trail.
  • Pattern: generation phase writes to .jsonl in a local data directory (e.g., scripts/seed/.challenge-data/, scripts/seed/.llm-test-data/), then a separate upsert phase reads the .jsonl and inserts to DB.
  • Example: rewrite-challenge-defects.ts writes to .challenge-data/challenges-rewritten.jsonl, then upsert-rewritten-challenges.ts reads it and applies to DB.
  • Decouples AI generation cost from DB insertion, so you can review results before committing them.

How do I estimate the cost of a seeding run before starting?

  • Use --dry-run flag (on seed-from-files.ts) to preview what would be generated without making API calls.
  • Cost table from SEED.md: entity generation ~$0.002/entity, fact explosion ~$0.01/entity, challenge content ~$0.006/fact, content cleanup ~$0.004/fact.
  • Typical curated-seed run (500 entities, medium richness): ~$68 total ($1 generation + $5 explosion + $2 validation + $60 challenges).
  • Volume controls in SEED.md: max_entities, max_facts, richness_tier, challenge_styles (fewer styles = lower cost).

Scripts

What does each script in scripts/seed/ do?

  • generate-curated-entries.ts -- Generate entity lists for topics from packages/ai/src/config/categories.ts (44 specs across 33 roots).
  • bulk-enqueue.ts -- Batch-enqueue EXPLODE_CATEGORY_ENTRY messages for all pending entries in seed_entry_queue.
  • seed-from-files.ts -- Full orchestrator: parse CSV/XLSX/DOCX -> enqueue explosion -> spinoffs -> super-facts.
  • generate-challenge-content.ts -- Generate/rewrite challenge content for existing facts (supports --partition).
  • rewrite-challenge-defects.ts -- Fix CQ-rule violations in existing challenge content (supports --partition).
  • improve-titles.ts -- Rewrite weak titles and challenge_titles.
  • llm-fact-quality-testing.ts -- 5-phase pipeline to compare AI models across quality dimensions. Supports --commit for local Supabase writes and --models for any registered model (see model testing).
  • cleanup-content.ts, backfill-fact-nulls.ts, materialize-entity-categories.ts, regen-voice-pass.ts -- Maintenance and backfill utilities.
  • Full index: scripts/script-index.md.

How do I generate curated entries and what controls their volume?

  • Command: bun scripts/seed/generate-curated-entries.ts --category <slug> --insert.
  • Volume controlled by packages/ai/src/config/categories.ts (44 CategorySpec definitions): each subcategory has a count (entities to generate, typically 25-200) and a prompt.
  • Richness tier (set in SEED.md volume.richness_tier): high (50-100 facts/entity), medium (20-50), low (10-20).
  • Topic presets available in SEED.md: all-active, news-popular, trivia-deep.

How does seed-from-files.ts work for importing from CSV/XLSX/DOCX?

  • Full pipeline orchestrator with stages: --parse (file -> seed_entry_queue), --explode (process pending entries), --explode-spinoffs, --super-facts, --all (run everything).
  • Command: bun scripts/seed/seed-from-files.ts --parse --file data.csv --topic history --insert.
  • Supports CSV, XLSX, DOCX via dedicated parsers in scripts/seed/lib/parsers/.
  • Additional flags: --dry-run (preview), --budget <dollars> (AI spend cap), --resume (skip completed), --batch-size <n>.

Troubleshooting

How do I resume a seeding run that was interrupted?

  • JSONL intermediary files in local data directories (e.g., .challenge-data/) preserve generation progress -- re-running the script skips already-written output.
  • For seed-from-files.ts: use --resume flag to skip entries with status = 'exploded' in seed_entry_queue.
  • For queue-based explosion: un-exploded entries remain status = 'pending' and will be picked up by the next bulk-enqueue.ts run.
  • Check DLQ (queue:explode_category_entry:dlq) for entries that failed 3 times and need manual intervention.

What controls are available in SEED.md?

  • SEED.md (docs/projects/seeding/SEED.md): seeding mode (news-only, curated-seed, evergreen-boost, full-pipeline), topic directives, volume controls, quality controls, execution concurrency.
  • Seed controls are now documented in SEED.md rather than in a runtime config file.
  • Volume knobs: richness_tier, max_entities, max_facts, challenge_styles, challenge_difficulty.

See Also

  • SEED.md -- Seeding control directives
  • Scripts: scripts/seed/