Seeding Control Prompt

Drop this file into a Claude Code session to direct seeding operations. Edit the sections below to control what gets seeded, how much, and at what quality level.

How it works: A Claude session reads this file, interprets your directives, and executes the appropriate scripts and commands. You control the what and how much — the system handles the how.

Quick Start

# In a Claude Code session, say:
# "Read docs/projects/seeding/SEED.md and execute my seeding directives"

Seeding Mode

Pick one. Uncomment the mode you want.

# mode: news-only
#   Only automated news ingestion runs. No manual seeding.
#   Good for: steady-state operations, low cost.

mode: curated-seed
#   Generate curated entity entries for specified topics, explode into facts,
#   validate, and generate challenge content.
#   Good for: building deep topic coverage.

# mode: evergreen-boost
#   Enable daily AI-generated evergreen facts in addition to news.
#   Good for: supplementing news with timeless content.

# mode: full-pipeline
#   News + curated seeding + evergreen + cleanup. Everything runs.
#   Good for: initial corpus build or major expansion.

Topic Directives

Specify which topics to seed and how deeply. Only topics listed here will be included in manual seeding runs. News ingestion always covers all active root topics.

topics:
  # Format: slug: count (entities to generate per subcategory)
  # Higher count = more entities = more facts = higher cost
  # Typical: 50-150 per subcategory

  science:
    enabled: true
    priority: high          # high | medium | low (controls processing order)
    subcategories:
      - name: "Physics & Space"
        count: 100
      - name: "Biology & Medicine"
        count: 100
      - name: "Chemistry & Materials"
        count: 75
      - name: "Scientific Instruments & Methods"
        count: 50
      - name: "Nobel Prize Winners & Discoveries"
        count: 75

  history:
    enabled: true
    priority: high
    subcategories:
      - name: "Ancient Civilizations"
        count: 150
      - name: "Medieval & Renaissance"
        count: 100
      - name: "Modern History"
        count: 150
      - name: "Post-War & Contemporary"
        count: 100
      - name: "Historic Figures"
        count: 200

  # geography:
  #   enabled: true
  #   priority: medium
  #   subcategories:
  #     - name: "Countries & Capitals"
  #       count: 200
  #     - name: "Natural Wonders"
  #       count: 80

  # culture:
  #   enabled: false    # Skip this topic entirely

  # --- Depth-2 targeting (future) ---
  # When depth-2 pipeline support is wired, you can target leaf subcategories:
  # science:
  #   subcategories:
  #     - name: "Physics & Space"
  #       depth2:
  #         - name: "Quantum Mechanics"
  #           count: 30
  #         - name: "Astrophysics"
  #           count: 30

44 topics available. The full list of topic slugs and their subcategories is defined in packages/ai/src/config/categories.ts (source of truth).

4-Level Category Hierarchy

The topic taxonomy supports 4 levels of depth (0–3):

Depth	Role	Count	Example
0	root	44	`science`
1	subcategory	162	`Physics & Space`
2	leaf	DB-derived	`Quantum Mechanics`
3	sub-leaf	DB-derived	(materialized from entities)

Current behavior: Seeding targets depth 0→1. AI extraction classifies facts to the deepest matching level (depth 0–3) via "classify to the deepest matching level" prompt instructions. The descendant query (getDescendantCategoriesForParent) walks 3 levels deep (children → children → children). Depth-2 and depth-3 categories are materialized in the database via materialize-entity-categories.ts, not defined in categories.ts.

Full taxonomy: See docs/spreadsheets/category-taxonomy.md for all categories (note: taxonomy doc may lag behind categories.ts — regenerate with bun run taxonomy:completeness-check to verify).

Topic Presets

Instead of listing individual topics, you can use a preset:

# preset: all-active
#   Seeds all active root topics using CATEGORIES defaults.
#   Equivalent to: bun scripts/seed/generate-curated-entries.ts --insert

# preset: news-popular
#   Seeds only topics that appear frequently in news APIs:
#   current-events, technology, business, sports, science, entertainment

# preset: trivia-deep
#   Deep seeding of knowledge/trivia-heavy topics:
#   history, science, geography, culture, records, sports, animals, space

Volume Controls

volume:
  # Richness tier overrides (controls facts generated per entity)
  #   high:   50-100 facts per entity (entertainment, sports, people)
  #   medium: 20-50 facts per entity (geography, science, animals)
  #   low:    10-20 facts per entity (business, design, fashion)
  richness_tier: medium         # default tier for all topics

  # Maximum entities to seed in this run
  max_entities: 500             # null = no limit (use CATEGORIES counts)

  # Maximum total facts to generate (safety cap)
  max_facts: 25000              # null = no limit

  # Difficulty level for challenge content (0-5)
  #   0 = balanced spread (generates all 5 levels evenly)
  #   1 = easy (default), 2 = moderate, 3 = hard, 4 = very hard, 5 = expert
  challenge_difficulty: 0       # 0=balanced spread across all levels

  # Which challenge styles to generate per fact (default: all 5)
  #   mc  = multiple_choice    (4 options, most engaging)
  #   dq  = direct_question    (classic quiz format)
  #   ftg = fill_the_gap       (cloze-style, good for learning)
  #   rl  = reverse_lookup     ("what am I describing?")
  #   ft  = free_text          (open-ended, hardest to auto-grade)
  # Fewer styles = lower cost. 3 core styles (mc,dq,ftg) cover most use cases.
  challenge_styles: mc,dq,ftg,rl,ft      # all 5 (default)
  # challenge_styles: mc,dq,ftg          # budget-friendly: 3 core styles (~40% less cost)

Quality Controls

quality:
  # Minimum notability score for keeping a fact (0.0 - 1.0)
  notability_threshold: 0.6     # default: 0.6

  # Validation strategy for AI-generated facts
  # Uses 4-phase pipeline: structural → consistency → cross-model → evidence
  validation_strategy: multi_phase      # multi_phase (default) | legacy_ai_cross_check

  # Run content cleanup after seeding?
  cleanup_after_seed: false     # true = rewrite titles/context for consistency

  # Generate challenge content automatically?
  generate_challenges: true     # true = generate 5 quiz styles per fact

  # CQ-002 enforcement (second-person address in challenges)
  patch_cq002: true             # true = auto-patch "you/your" into challenges

Evergreen Controls

Only applies when mode is evergreen-boost or full-pipeline.

evergreen:
  enabled: false                # Set true to enable daily AI fact generation
  daily_quota: 20               # Facts per day across all topics
  distribution:                 # Optional: override per-topic share
    science: 25%                # 5 facts/day
    history: 25%                # 5 facts/day
    geography: 15%              # 3 facts/day
    # Remaining 35% split equally among other active topics

Execution Controls

execution:
  # Concurrency for AI calls (per script instance)
  concurrency: 5                # 1-10, higher = faster but more API pressure

  # Number of parallel partitions for large runs
  partitions: 4                 # 1-8, each partition runs independently

  # Dry run first?
  dry_run_first: true           # true = preview before executing

  # Auto-upload to DB after generation?
  auto_upload: false            # true = automatically upsert to DB
                                # false = generate to JSONL, wait for manual upload

  # AI model preference (DB-driven via ai_model_tier_config)
  # Eligible models (97% threshold passed): run llm-fact-quality-testing.ts to verify
  # Default: gemini-3-flash-preview (promoted Feb 25, 2026; regressed Mar 4 — retest before use)
  # Note: gpt-5-mini fails on voice/style dimensions — not eligible for production seeding
  # Model routing is DB-driven — change via SQL UPDATE, no restart needed
  # Results: scripts/seed/.llm-test-data/eligibility.jsonl
  model: gemini-3-flash-preview          # or any model in ai_model_tier_config

News Ingestion Controls

Controls for the automated news pipeline. Changes here affect environment config.

news:
  # Which providers to use (requires API keys in .env.local)
  providers:
    - newsapi                   # NEWS_API_KEY
    - gnews                     # GOOGLE_NEWS_API_KEY
    - thenewsapi                # THENEWS_API_KEY
    - newsdata                  # NEWSDATA_API_KEY
    - event_registry            # EVENT_REGISTRY_API_KEY

  # Max articles per provider per category per run
  max_results: 20               # default: 20

  # Ingestion interval (cron frequency in minutes)
  interval_minutes: 15          # default: 15

Enrichment Orchestrator

In addition to the primary news providers, the enrichment orchestrator (packages/ai/src/enrichment.ts) injects context from 8 free API sources during fact extraction. This is context injection, not primary article fetching.

Source	Routing	Purpose
Knowledge Graph	Always	Entity identification, notability signals
Wikidata	Always	Structured facts, identifiers
Wikipedia	Always	Summary context, descriptions
GDELT 2.0	Always	Global event context, media mentions
TheSportsDB	`sports/*` topics	Team/player data
MusicBrainz	`music/*` topics	Artist/album metadata
Nominatim	`geography/*` topics	Location geocoding
Open Library	`books/*` topics	Book/author data

All enrichment calls use Promise.allSettled() — a failing API never blocks extraction.

Cost Estimates

Approximate costs per operation (using GPT-5 Mini / Gemini 3 Flash Preview):

Operation	Per-Unit Cost	Example
Entity generation	~$0.002/entity	500 entities = ~$1.00
Fact explosion	~$0.01/entity	500 entities = ~$5.00
Challenge content	~$0.006/fact	10,000 facts = ~$60.00
Content cleanup	~$0.004/fact	10,000 facts = ~$40.00
News extraction	~$0.003/story	100 stories/day = ~$0.30/day
Evergreen generation	~$0.005/fact	20 facts/day = ~$0.10/day
Entity category materialization	~$0.002/entity	500 entities = ~$1.00
Voice-pass regeneration	~$0.003/record	10,000 records = ~$30.00
Model eligibility testing	~$0.50/model	Per full 5-phase run

Typical curated-seed run (500 entities, medium richness):

Entity generation: ~$1
Explosion: ~$5
Validation: ~$2
Challenge content: ~$60
Total: ~$68

Daily Production Budget ($20/day) — Split Routing

The pipeline uses split model routing to maximize throughput within a $20/day AI budget. GPT-5.4 Nano handles cost-efficient bulk generation; topics where it produces poor quality are routed to Gemini 3 Flash Preview instead.

Per-model costs:

Model	Input $/MTok	Output $/MTok	Cost/Fact	Cost/Challenge
gpt-5.4-nano	$0.20	$1.25	$0.006	$0.0048
gemini-3-flash-preview	$0.50	$3.00	$0.015	$0.0106

Topic routing (as of March 30, 2026):

Topics	% of entities	Model	Reason
Sports, Music, Science	~60%	gemini-3-flash-preview	Nano fabricates sports stats, struggles with music nuance, fails science cross-challenge isolation
All other topics	~40%	gpt-5.4-nano	Good quality at 2.2x lower cost

Throughput at $20/day:

Scenario	Weighted cost/challenge	Challenges/day	Facts/day
All Gemini	$0.0106	~1,887	~86
Split (sports+music→Gemini)	$0.0076	~2,635	~120
Split (sports+music+science→Gemini)	$0.0083	~2,415	~110

Adding science to the Gemini handoff reduces throughput by ~8% (~10 fewer facts/day) but eliminates wasted budget on science content that fails validation (46.7% → 85% validation rate improvement).

Model Eligibility

Before a model can be used for production seeding, it must pass the eligibility gate. The test pipeline (scripts/seed/llm-fact-quality-testing.ts) evaluates models across 7 quality dimensions and writes results to scripts/seed/.llm-test-data/eligibility.jsonl.

Threshold: All 7 dimensions must score 97% or higher.

Currently eligible: Verify before each run — gemini-3-flash-preview passed Feb 25 but regressed Mar 4; gemini-2.5-flash has mixed results. Run llm-fact-quality-testing.ts --all --models <model> to confirm. Models like gpt-5-mini fail on voice/style adherence dimensions.

Dimension	What it measures
validation	Facts pass multi-phase validation pipeline
evidence	Facts corroborated by external evidence
challenges	Challenge content passes CQ rules
schema_adherence	Output conforms to topic-specific schemas
voice_adherence	Matches Eko voice constitution and taxonomy voice
style_adherence	Follows per-style rules
token_efficiency	Output stays within token budget

Each model has a dedicated ModelAdapter (packages/ai/src/models/adapters/) that provides per-model prompt optimizations and signoff guidance for the quality reviewer. See Model Code Isolation for details.

Execution Checklist

When a Claude session reads this file, it should:

Parse the directives above
Verify prerequisites (OPENAI_API_KEY or GOOGLE_API_KEY, DATABASE_URL in .env.local)
Verify model eligibility — confirm the configured model has eligible: true in scripts/seed/.llm-test-data/eligibility.jsonl. If not, run llm-fact-quality-testing.ts --all --models <model> first.
If dry_run_first: true, run --dry-run and report estimates
Wait for user confirmation before proceeding
Execute in order: a. Generate curated entries (if mode includes curated seeding) b. Bulk enqueue to workers c. Monitor explosion progress d. Run validation workers e. Generate challenge content (if generate_challenges: true) f. Run cleanup (if cleanup_after_seed: true) g. Upload to DB (if auto_upload: true) h. (Optional) Materialize entity categories (materialize-entity-categories.ts --audit then --classify --insert --link)
Report final stats: facts generated, challenges created, cost, errors

Reference: Available Scripts

Script	Purpose	Key Flags
`generate-curated-entries.ts`	AI-generate entity names	`--insert`, `--category <slug>`
`bulk-enqueue.ts`	Dispatch entries to explosion workers	(no flags)
`generate-challenge-content.ts`	Batch challenge generation	`--audit`, `--export`, `--generate`, `--upload`, `--validate`, `--recover`, `--dry-run`, `--limit N`, `--concurrency N`, `--partition N/M`, `--output-suffix NAME`, `--difficulty N`, `--styles mc,dq,ftg`, `--format <slug>`, `--drift-check`, `--export-all`
`cleanup-content.ts`	Rewrite titles/context	`--audit`, `--export`, `--fix`, `--upload`, `--validate`, `--dry-run`, `--limit N`, `--concurrency N`, `--partition N/M`, `--output-suffix NAME`
`backfill-fact-nulls.ts`	Fill NULL metadata	`--audit`, `--notability`, `--challenge-content`, `--all`, `--dry-run`
`seed-from-files.ts`	Parse XLSX/DOCX/CSV and seed	`--parse`, `--explode`, `--explode-spinoffs`, `--super-facts`, `--stats`, `--all`, `--dry-run`, `--topic <slug>`, `--budget <dollars>`, `--resume`, `--batch-size <n>`
`llm-fact-quality-testing.ts`	Model eligibility testing	`--all`, `--generate`, `--validate`, `--challenge`, `--signoff`, `--report`, `--models <csv>`, `--limit N`, `--concurrency N`, `--output-dir <name>`, `--merge-dirs <csv>`, `--signoff-model <id>`
`materialize-entity-categories.ts`	Create leaf categories from entities	`--audit`, `--classify`, `--insert`, `--link`, `--concurrency N`, `--dry-run`
`regen-voice-pass.ts`	Regenerate voice-pass content	`--dry-run`, `--limit N`, `--partition M/N`, `--concurrency N`, `--presplit`
`rewrite-challenge-defects.ts`	Fix challenge quality defects	`--dry-run`, `--limit N`, `--partition M/N`, `--concurrency N`, `--presplit`
`presplit-defects.ts`	Pre-split defective records into partitions	`--partitions N`
`upsert-rewritten-challenges.ts`	Upload rewritten challenges to DB	`--dry-run`
`cleanup-seed-queue.ts`	Remove garbage entries from seed queue	(one-time)
`deepen-topic-paths.ts`	AI-reclassify entries to deeper categories	`--dry-run`, `--limit N`, `--concurrency N`
`remap-topic-paths.ts`	Batch fix malformed topic paths	`--dry-run`, `--limit N`
~~`improve-titles.ts`~~	~~Rewrite weak titles~~	Deprecated — use `cleanup-content.ts`

See runbook.md for detailed operational procedures.

APP-CONTROL.md — App control manifest (crons, workers, queues, APIs)
Category Taxonomy — Full 3-level category hierarchy (44 roots + 162 subcategories + DB-derived leaves)
categories.ts — Source of truth for topic slugs and subcategories
Ops Logs — Operational event logging (parallel to seed logs)
README.md — Seeding pipeline documentation
runbook.md — Operational procedures
seeding-best-practices.md — Strategies and cost management

Recent Seed Logs

All seed jobs are logged in logs/ with structured frontmatter for tracking.

Latest logs: February 2026
Log template: logs/README.md

After each seeding run, create a log file using the template and update the monthly index.

#Seeding Control Prompt

#Quick Start

#Seeding Mode

#Topic Directives

#4-Level Category Hierarchy

#Topic Presets

#Volume Controls

#Quality Controls

#Evergreen Controls

#Execution Controls

#News Ingestion Controls

#Enrichment Orchestrator

#Cost Estimates

#Daily Production Budget ($20/day) — Split Routing

#Model Eligibility

#Execution Checklist

#Reference: Available Scripts

#Related Documents

#Recent Seed Logs