Fact Engine Seeding — TODO

Tracks progress across fact engine seeding work: challenge content generation, taxonomy expansion, frontend integration, content cleanup, and pipeline maintenance.

Context

  • Seeding control prompt: SEED.md — edit this to direct seeding operations
  • Best practices: seeding-best-practices.md — strategies, examples, cost management
  • Runbook: runbook.md — step-by-step operational procedures
  • Challenge content table: fact_challenge_content (migration 0121, correct_answer added in 0122)
  • Generation script: scripts/seed/generate-challenge-content.ts
  • Cleanup script: scripts/seed/cleanup-content.ts
  • AI model: gpt-5-mini (via ModelAdapter; see model-code-isolation.md)
  • Rules: docs/rules/challenge-content.md (CC-001 through CC-009, CQ-001 through CQ-008)

Current State

MetricValue
Total validated facts144,310
Facts with challenge content (pre-regen)93,901 (65.1%)
Challenge rows in DB (pre-regen)378,816
Full regenerationIn progress — targeting 100% coverage
correct_answer columnBackfilling (all pre-regen rows are NULL)
CQ-002 compliance (pre-regen)~60% on original 373K rows
Total AI spend to date~$95 (+ ~$85 est. for in-progress regeneration)

Completed

  • Step 0: Challenge content rules + AI constitution (challenge-content-rules.ts)
  • Step 1: DB migration (0121_add_fact_challenge_content.sql)
  • Step 2: Drizzle schema + queries (factChallengeContent table, relations)
  • Step 3: AI generation function (packages/ai/src/challenge-content.ts)
  • Step 4: Register challenge_content_generation task type in fact-engine
  • Step 5a: Initial bulk generation — 373,937 rows across 8 partitions (~$93)
  • Step 5b: Recovery generation — 4,905 rows for 1,979 previously uncovered facts (~$2)
  • Step 5c: Upload all content to DB (378,816 rows)
  • CQ-002 prompt strengthening (explicit formatting rules + examples in AI prompt)
  • CQ-002 regex fix (/\byou\b/i/\byou(r|rs|rself)?\b/i)
  • CQ-002 generation-time filter (drop non-compliant challenges before JSONL write)
  • Add correct_answer column to fact_challenge_content (migration 0122)
  • CQ-008 rule + AI prompt includes correct_answer (3-6 sentence narrative for streaming display)

In Progress

  • Full regeneration (CQ-002 fix + gap fill + correct_answer backfill) — Regenerate ALL 144,310 validated facts with improved prompt
    • Fixes CQ-002 compliance on the 93,901 facts that had ~60% pass rate
    • Fills the 50,409-fact coverage gap (facts that never had challenge content)
    • Backfills correct_answer column (migration 0122) — all existing rows have NULL from pre-column generation
    • Target: 100% fact coverage + >= 95% CQ-002 pass rate + correct_answer populated on all rows
    • Estimated cost: ~$85 (144,310 facts vs. previous estimate of ~$55 for 93,901)
    • Estimated time: ~11-12 hours with 8 partitions at concurrency 5
    • Approach: --export-all exports all 144,310 validated facts, --generate with 8 partitions, --upload upserts to overwrite existing + insert new rows
    • Generation (8 partitions running since Feb 18 14:22)
    • Upload (--upload upsert after generation completes)
    • Validate (--validate sample for CQ-002 pass rate)

Taxonomy Expansion

  • 01-taxonomy-expansion.md <- NEXT
    • Create challenge document
    • Evaluate challenges
    • Implement Wave 1 (migration + safeguards)
    • Implement Wave 2 (entity materialization script)
    • Implement Wave 3 (curated entries integration)
    • Re-evaluate until PASS

Taxonomy Coherence

  • 04-taxonomy-coherence.md <- NEXT (parallel with taxonomy expansion)
    • Create challenge document
    • Implement Wave 1: category alias table + resolveTopicCategory() + extract-facts update
    • Evaluate Wave 1 challenges
    • Implement Wave 2 (unmapped category audit script)
    • Implement Wave 3 (depth-aware cron dispatch + subcategory routing)
    • Re-evaluate until PASS

Challenge Content Frontend Integration

  • 03-frontend-challenge-content.md — GTD challenge doc (M complexity)
    • Create challenge document
    • Evaluate challenges
    • Implement Wave 1 (API endpoint + data loading)
    • Implement Wave 2 (UI component updates)
    • Implement Wave 3 (queue integration for new facts)
    • Re-evaluate until PASS

Content Cleanup

  • Content cleanup pass — Rewrite titles/context across 144K facts for quality
    • Script: scripts/seed/cleanup-content.ts (5 phases: audit, export, fix, upload, validate)
    • Run --audit to assess corpus quality baseline
    • Run --export + --fix with partitioned generation
    • Run --upload + --validate
    • Wire maintenance cron for periodic re-runs on new facts

Pipeline Maintenance Crons

  • Wire promoteHighEngagementFacts() — Add to daily maintenance cron; query exists at fact-engine-queries.ts:983
  • Wire abandonStaleSessions() — Add to hourly maintenance cron; query exists at fact-engine-queries.ts:1549
  • Wire milestone claim processing — Add cron + API route calling claimMilestone() at fact-engine-queries.ts:1207

Backfill Fact Nulls

  • 02-backfill-fact-nulls.md
    • Wave 1: Schema & type expansion
    • Wave 2: Pipeline wiring fixes
    • Wave 3: Backfill existing data (run script against production)

TODO

  • Difficulty levels 2-5 — Generate harder challenge content (currently only level 1)
  • Changelog entry — Update docs/changelog/02-2026.md with challenge content feature

Regeneration Runbook

Full Regeneration Procedure

# 1. Export ALL validated facts (both covered and uncovered)
bun scripts/seed/generate-challenge-content.ts --export-all

# 2. Clear old generated files (optional — resume logic handles dedup)
rm scripts/seed/.challenge-data/challenges-generated*.jsonl

# 3. Generate with 8 parallel partitions
for i in 1 2 3 4 5 6 7 8; do
  bun scripts/seed/generate-challenge-content.ts \
    --generate --partition $i/8 --output-suffix regen-p$i --concurrency 5 &
done
wait

# 4. Upload (upsert overwrites existing rows)
bun scripts/seed/generate-challenge-content.ts --upload

# 5. Validate
bun scripts/seed/generate-challenge-content.ts --validate

Cost Tracking

Historical cost data is now tracked in per-job seed logs. See logs/2026-02/index.md for the February 2026 summary.

Legacy table (pre-logging system):

RunRowsCostDate
Initial bulk (8 partitions)373,937~$932026-02-17
Recovery (4 partitions)4,905~$22026-02-18
Full regeneration (CQ-002 + gap fill)~577K est.~$85 est.2026-02-18 (in progress)
Total~956K est.~$180 est.

Note: Future runs should create a seed log file instead of adding rows here. See logs/README.md.

Architecture Notes

  • JSONL pipeline: All AI output writes to local .jsonl files first, then bulk-uploads to DB. This pattern provides crash-resilience and lets us inspect/filter output before committing.
  • Partition-based parallelism: --partition N/M splits the export into M chunks. Each partition writes to its own file via --output-suffix.
  • Resume-safe: The --generate phase scans ALL challenges-generated*.jsonl files to find already-processed fact IDs. Interrupted runs can be restarted without re-processing.
  • Upsert upload: onConflictDoUpdate on (fact_record_id, challenge_style, target_fact_key, difficulty) means regenerated content overwrites old content automatically.
  • Three-layer CQ-002 enforcement: (1) Prompt-level instruction with examples, (2) Generation-time regex filter drops non-compliant output, (3) Post-upload validation sampling.