Seeding Control Prompt

Drop this file into a Claude Code session to direct seeding operations. Edit the sections below to control what gets seeded, how much, and at what quality level.

How it works: A Claude session reads this file, interprets your directives, and executes the appropriate scripts and commands. You control the what and how much — the system handles the how.


Quick Start

# In a Claude Code session, say:
# "Read docs/projects/seeding/SEED.md and execute my seeding directives"

Seeding Mode

Pick one. Uncomment the mode you want.

# mode: news-only
#   Only automated news ingestion runs. No manual seeding.
#   Good for: steady-state operations, low cost.

mode: curated-seed
#   Generate curated entity entries for specified topics, explode into facts,
#   validate, and generate challenge content.
#   Good for: building deep topic coverage.

# mode: evergreen-boost
#   Enable daily AI-generated evergreen facts in addition to news.
#   Good for: supplementing news with timeless content.

# mode: full-pipeline
#   News + curated seeding + evergreen + cleanup. Everything runs.
#   Good for: initial corpus build or major expansion.

Topic Directives

Specify which topics to seed and how deeply. Only topics listed here will be included in manual seeding runs. News ingestion always covers all active root topics.

topics:
  # Format: slug: count (entities to generate per subcategory)
  # Higher count = more entities = more facts = higher cost
  # Typical: 50-150 per subcategory

  science:
    enabled: true
    priority: high          # high | medium | low (controls processing order)
    subcategories:
      - name: "Physics & Space"
        count: 100
      - name: "Biology & Medicine"
        count: 100
      - name: "Chemistry & Materials"
        count: 75
      - name: "Scientific Instruments & Methods"
        count: 50
      - name: "Nobel Prize Winners & Discoveries"
        count: 75

  history:
    enabled: true
    priority: high
    subcategories:
      - name: "Ancient Civilizations"
        count: 150
      - name: "Medieval & Renaissance"
        count: 100
      - name: "Modern History"
        count: 150
      - name: "Post-War & Contemporary"
        count: 100
      - name: "Historic Figures"
        count: 200

  # geography:
  #   enabled: true
  #   priority: medium
  #   subcategories:
  #     - name: "Countries & Capitals"
  #       count: 200
  #     - name: "Natural Wonders"
  #       count: 80

  # culture:
  #   enabled: false    # Skip this topic entirely

  # --- Depth-2 targeting (future) ---
  # When depth-2 pipeline support is wired, you can target leaf subcategories:
  # science:
  #   subcategories:
  #     - name: "Physics & Space"
  #       depth2:
  #         - name: "Quantum Mechanics"
  #           count: 30
  #         - name: "Astrophysics"
  #           count: 30

44 topics available. The full list of topic slugs and their subcategories is defined in packages/ai/src/config/categories.ts (source of truth).

4-Level Category Hierarchy

The topic taxonomy supports 4 levels of depth (0–3):

DepthRoleCountExample
0root44science
1subcategory162Physics & Space
2leafDB-derivedQuantum Mechanics
3sub-leafDB-derived(materialized from entities)

Current behavior: Seeding targets depth 0→1. AI extraction classifies facts to the deepest matching level (depth 0–3) via "classify to the deepest matching level" prompt instructions. The descendant query (getDescendantCategoriesForParent) walks 3 levels deep (children → children → children). Depth-2 and depth-3 categories are materialized in the database via materialize-entity-categories.ts, not defined in categories.ts.

Full taxonomy: See docs/spreadsheets/category-taxonomy.md for all categories (note: taxonomy doc may lag behind categories.ts — regenerate with bun run taxonomy:completeness-check to verify).

Topic Presets

Instead of listing individual topics, you can use a preset:

# preset: all-active
#   Seeds all active root topics using CATEGORIES defaults.
#   Equivalent to: bun scripts/seed/generate-curated-entries.ts --insert

# preset: news-popular
#   Seeds only topics that appear frequently in news APIs:
#   current-events, technology, business, sports, science, entertainment

# preset: trivia-deep
#   Deep seeding of knowledge/trivia-heavy topics:
#   history, science, geography, culture, records, sports, animals, space

Volume Controls

volume:
  # Richness tier overrides (controls facts generated per entity)
  #   high:   50-100 facts per entity (entertainment, sports, people)
  #   medium: 20-50 facts per entity (geography, science, animals)
  #   low:    10-20 facts per entity (business, design, fashion)
  richness_tier: medium         # default tier for all topics

  # Maximum entities to seed in this run
  max_entities: 500             # null = no limit (use CATEGORIES counts)

  # Maximum total facts to generate (safety cap)
  max_facts: 25000              # null = no limit

  # Difficulty level for challenge content (0-5)
  #   0 = balanced spread (generates all 5 levels evenly)
  #   1 = easy (default), 2 = moderate, 3 = hard, 4 = very hard, 5 = expert
  challenge_difficulty: 0       # 0=balanced spread across all levels

  # Which challenge styles to generate per fact (default: all 5)
  #   mc  = multiple_choice    (4 options, most engaging)
  #   dq  = direct_question    (classic quiz format)
  #   ftg = fill_the_gap       (cloze-style, good for learning)
  #   rl  = reverse_lookup     ("what am I describing?")
  #   ft  = free_text          (open-ended, hardest to auto-grade)
  # Fewer styles = lower cost. 3 core styles (mc,dq,ftg) cover most use cases.
  challenge_styles: mc,dq,ftg,rl,ft      # all 5 (default)
  # challenge_styles: mc,dq,ftg          # budget-friendly: 3 core styles (~40% less cost)

Quality Controls

quality:
  # Minimum notability score for keeping a fact (0.0 - 1.0)
  notability_threshold: 0.6     # default: 0.6

  # Validation strategy for AI-generated facts
  # Uses 4-phase pipeline: structural → consistency → cross-model → evidence
  validation_strategy: multi_phase      # multi_phase (default) | legacy_ai_cross_check

  # Run content cleanup after seeding?
  cleanup_after_seed: false     # true = rewrite titles/context for consistency

  # Generate challenge content automatically?
  generate_challenges: true     # true = generate 5 quiz styles per fact

  # CQ-002 enforcement (second-person address in challenges)
  patch_cq002: true             # true = auto-patch "you/your" into challenges

Evergreen Controls

Only applies when mode is evergreen-boost or full-pipeline.

evergreen:
  enabled: false                # Set true to enable daily AI fact generation
  daily_quota: 20               # Facts per day across all topics
  distribution:                 # Optional: override per-topic share
    science: 25%                # 5 facts/day
    history: 25%                # 5 facts/day
    geography: 15%              # 3 facts/day
    # Remaining 35% split equally among other active topics

Execution Controls

execution:
  # Concurrency for AI calls (per script instance)
  concurrency: 5                # 1-10, higher = faster but more API pressure

  # Number of parallel partitions for large runs
  partitions: 4                 # 1-8, each partition runs independently

  # Dry run first?
  dry_run_first: true           # true = preview before executing

  # Auto-upload to DB after generation?
  auto_upload: false            # true = automatically upsert to DB
                                # false = generate to JSONL, wait for manual upload

  # AI model preference (DB-driven via ai_model_tier_config)
  # Eligible models (97% threshold passed): run llm-fact-quality-testing.ts to verify
  # Default: gemini-3-flash-preview (promoted Feb 25, 2026; regressed Mar 4 — retest before use)
  # Note: gpt-5-mini fails on voice/style dimensions — not eligible for production seeding
  # Model routing is DB-driven — change via SQL UPDATE, no restart needed
  # Results: scripts/seed/.llm-test-data/eligibility.jsonl
  model: gemini-3-flash-preview          # or any model in ai_model_tier_config

News Ingestion Controls

Controls for the automated news pipeline. Changes here affect environment config.

news:
  # Which providers to use (requires API keys in .env.local)
  providers:
    - newsapi                   # NEWS_API_KEY
    - gnews                     # GOOGLE_NEWS_API_KEY
    - thenewsapi                # THENEWS_API_KEY
    - newsdata                  # NEWSDATA_API_KEY
    - event_registry            # EVENT_REGISTRY_API_KEY

  # Max articles per provider per category per run
  max_results: 20               # default: 20

  # Ingestion interval (cron frequency in minutes)
  interval_minutes: 15          # default: 15

Enrichment Orchestrator

In addition to the primary news providers, the enrichment orchestrator (packages/ai/src/enrichment.ts) injects context from 8 free API sources during fact extraction. This is context injection, not primary article fetching.

SourceRoutingPurpose
Knowledge GraphAlwaysEntity identification, notability signals
WikidataAlwaysStructured facts, identifiers
WikipediaAlwaysSummary context, descriptions
GDELT 2.0AlwaysGlobal event context, media mentions
TheSportsDBsports/* topicsTeam/player data
MusicBrainzmusic/* topicsArtist/album metadata
Nominatimgeography/* topicsLocation geocoding
Open Librarybooks/* topicsBook/author data

All enrichment calls use Promise.allSettled() — a failing API never blocks extraction.


Cost Estimates

Approximate costs per operation (using GPT-5 Mini / Gemini 3 Flash Preview):

OperationPer-Unit CostExample
Entity generation~$0.002/entity500 entities = ~$1.00
Fact explosion~$0.01/entity500 entities = ~$5.00
Challenge content~$0.006/fact10,000 facts = ~$60.00
Content cleanup~$0.004/fact10,000 facts = ~$40.00
News extraction~$0.003/story100 stories/day = ~$0.30/day
Evergreen generation~$0.005/fact20 facts/day = ~$0.10/day
Entity category materialization~$0.002/entity500 entities = ~$1.00
Voice-pass regeneration~$0.003/record10,000 records = ~$30.00
Model eligibility testing~$0.50/modelPer full 5-phase run

Typical curated-seed run (500 entities, medium richness):

  • Entity generation: ~$1
  • Explosion: ~$5
  • Validation: ~$2
  • Challenge content: ~$60
  • Total: ~$68

Daily Production Budget ($20/day) — Split Routing

The pipeline uses split model routing to maximize throughput within a $20/day AI budget. GPT-5.4 Nano handles cost-efficient bulk generation; topics where it produces poor quality are routed to Gemini 3 Flash Preview instead.

Per-model costs:

ModelInput $/MTokOutput $/MTokCost/FactCost/Challenge
gpt-5.4-nano$0.20$1.25$0.006$0.0048
gemini-3-flash-preview$0.50$3.00$0.015$0.0106

Topic routing (as of March 30, 2026):

Topics% of entitiesModelReason
Sports, Music, Science~60%gemini-3-flash-previewNano fabricates sports stats, struggles with music nuance, fails science cross-challenge isolation
All other topics~40%gpt-5.4-nanoGood quality at 2.2x lower cost

Throughput at $20/day:

ScenarioWeighted cost/challengeChallenges/dayFacts/day
All Gemini$0.0106~1,887~86
Split (sports+music→Gemini)$0.0076~2,635~120
Split (sports+music+science→Gemini)$0.0083~2,415~110

Adding science to the Gemini handoff reduces throughput by ~8% (~10 fewer facts/day) but eliminates wasted budget on science content that fails validation (46.7% → 85% validation rate improvement).


Model Eligibility

Before a model can be used for production seeding, it must pass the eligibility gate. The test pipeline (scripts/seed/llm-fact-quality-testing.ts) evaluates models across 7 quality dimensions and writes results to scripts/seed/.llm-test-data/eligibility.jsonl.

Threshold: All 7 dimensions must score 97% or higher.

Currently eligible: Verify before each run — gemini-3-flash-preview passed Feb 25 but regressed Mar 4; gemini-2.5-flash has mixed results. Run llm-fact-quality-testing.ts --all --models <model> to confirm. Models like gpt-5-mini fail on voice/style adherence dimensions.

DimensionWhat it measures
validationFacts pass multi-phase validation pipeline
evidenceFacts corroborated by external evidence
challengesChallenge content passes CQ rules
schema_adherenceOutput conforms to topic-specific schemas
voice_adherenceMatches Eko voice constitution and taxonomy voice
style_adherenceFollows per-style rules
token_efficiencyOutput stays within token budget

Each model has a dedicated ModelAdapter (packages/ai/src/models/adapters/) that provides per-model prompt optimizations and signoff guidance for the quality reviewer. See Model Code Isolation for details.


Execution Checklist

When a Claude session reads this file, it should:

  1. Parse the directives above
  2. Verify prerequisites (OPENAI_API_KEY or GOOGLE_API_KEY, DATABASE_URL in .env.local)
  3. Verify model eligibility — confirm the configured model has eligible: true in scripts/seed/.llm-test-data/eligibility.jsonl. If not, run llm-fact-quality-testing.ts --all --models <model> first.
  4. If dry_run_first: true, run --dry-run and report estimates
  5. Wait for user confirmation before proceeding
  6. Execute in order: a. Generate curated entries (if mode includes curated seeding) b. Bulk enqueue to workers c. Monitor explosion progress d. Run validation workers e. Generate challenge content (if generate_challenges: true) f. Run cleanup (if cleanup_after_seed: true) g. Upload to DB (if auto_upload: true) h. (Optional) Materialize entity categories (materialize-entity-categories.ts --audit then --classify --insert --link)
  7. Report final stats: facts generated, challenges created, cost, errors

Reference: Available Scripts

ScriptPurposeKey Flags
generate-curated-entries.tsAI-generate entity names--insert, --category <slug>
bulk-enqueue.tsDispatch entries to explosion workers(no flags)
generate-challenge-content.tsBatch challenge generation--audit, --export, --generate, --upload, --validate, --recover, --dry-run, --limit N, --concurrency N, --partition N/M, --output-suffix NAME, --difficulty N, --styles mc,dq,ftg, --format <slug>, --drift-check, --export-all
cleanup-content.tsRewrite titles/context--audit, --export, --fix, --upload, --validate, --dry-run, --limit N, --concurrency N, --partition N/M, --output-suffix NAME
backfill-fact-nulls.tsFill NULL metadata--audit, --notability, --challenge-content, --all, --dry-run
seed-from-files.tsParse XLSX/DOCX/CSV and seed--parse, --explode, --explode-spinoffs, --super-facts, --stats, --all, --dry-run, --topic <slug>, --budget <dollars>, --resume, --batch-size <n>
llm-fact-quality-testing.tsModel eligibility testing--all, --generate, --validate, --challenge, --signoff, --report, --models <csv>, --limit N, --concurrency N, --output-dir <name>, --merge-dirs <csv>, --signoff-model <id>
materialize-entity-categories.tsCreate leaf categories from entities--audit, --classify, --insert, --link, --concurrency N, --dry-run
regen-voice-pass.tsRegenerate voice-pass content--dry-run, --limit N, --partition M/N, --concurrency N, --presplit
rewrite-challenge-defects.tsFix challenge quality defects--dry-run, --limit N, --partition M/N, --concurrency N, --presplit
presplit-defects.tsPre-split defective records into partitions--partitions N
upsert-rewritten-challenges.tsUpload rewritten challenges to DB--dry-run
cleanup-seed-queue.tsRemove garbage entries from seed queue(one-time)
deepen-topic-paths.tsAI-reclassify entries to deeper categories--dry-run, --limit N, --concurrency N
remap-topic-paths.tsBatch fix malformed topic paths--dry-run, --limit N
improve-titles.tsRewrite weak titlesDeprecated — use cleanup-content.ts

See runbook.md for detailed operational procedures.



Recent Seed Logs

All seed jobs are logged in logs/ with structured frontmatter for tracking.

After each seeding run, create a log file using the template and update the monthly index.