Fact Engine Seeding — TODO

Tracks progress across fact engine seeding work: challenge content generation, taxonomy expansion, frontend integration, content cleanup, and pipeline maintenance.

Context

Seeding control prompt: SEED.md — edit this to direct seeding operations
Best practices: seeding-best-practices.md — strategies, examples, cost management
Runbook: runbook.md — step-by-step operational procedures
Challenge content table: fact_challenge_content (migration 0121, correct_answer added in 0122)
Generation script: scripts/seed/generate-challenge-content.ts
Cleanup script: scripts/seed/cleanup-content.ts
AI model: gpt-5-mini (via ModelAdapter; see model-code-isolation.md)
Rules: docs/rules/challenge-content.md (CC-001 through CC-009, CQ-001 through CQ-008)

Current State

Metric	Value
Total validated facts	144,310
Facts with challenge content (pre-regen)	93,901 (65.1%)
Challenge rows in DB (pre-regen)	378,816
Full regeneration	In progress — targeting 100% coverage
`correct_answer` column	Backfilling (all pre-regen rows are NULL)
CQ-002 compliance (pre-regen)	~60% on original 373K rows
Total AI spend to date	~$95 (+ ~$85 est. for in-progress regeneration)

Completed

In Progress

Full regeneration (CQ-002 fix + gap fill + correct_answer backfill) — Regenerate ALL 144,310 validated facts with improved prompt
- Fixes CQ-002 compliance on the 93,901 facts that had ~60% pass rate
- Fills the 50,409-fact coverage gap (facts that never had challenge content)
- Backfills correct_answer column (migration 0122) — all existing rows have NULL from pre-column generation
- Target: 100% fact coverage + >= 95% CQ-002 pass rate + correct_answer populated on all rows
- Estimated cost: ~$85 (144,310 facts vs. previous estimate of ~$55 for 93,901)
- Estimated time: ~11-12 hours with 8 partitions at concurrency 5
- Approach: --export-all exports all 144,310 validated facts, --generate with 8 partitions, --upload upserts to overwrite existing + insert new rows
- Generation (8 partitions running since Feb 18 14:22)
- Upload (--upload upsert after generation completes)
- Validate (--validate sample for CQ-002 pass rate)

Taxonomy Expansion

01-taxonomy-expansion.md <- NEXT
- Create challenge document
- Evaluate challenges
- Implement Wave 1 (migration + safeguards)
- Implement Wave 2 (entity materialization script)
- Implement Wave 3 (curated entries integration)
- Re-evaluate until PASS

Taxonomy Coherence

04-taxonomy-coherence.md <- NEXT (parallel with taxonomy expansion)
- Create challenge document
- Implement Wave 1: category alias table + resolveTopicCategory() + extract-facts update
- Evaluate Wave 1 challenges
- Implement Wave 2 (unmapped category audit script)
- Implement Wave 3 (depth-aware cron dispatch + subcategory routing)
- Re-evaluate until PASS

Challenge Content Frontend Integration

03-frontend-challenge-content.md — GTD challenge doc (M complexity)
- Create challenge document
- Evaluate challenges
- Implement Wave 1 (API endpoint + data loading)
- Implement Wave 2 (UI component updates)
- Implement Wave 3 (queue integration for new facts)
- Re-evaluate until PASS

Content Cleanup

Content cleanup pass — Rewrite titles/context across 144K facts for quality
- Script: scripts/seed/cleanup-content.ts (5 phases: audit, export, fix, upload, validate)
- Run --audit to assess corpus quality baseline
- Run --export + --fix with partitioned generation
- Run --upload + --validate
- Wire maintenance cron for periodic re-runs on new facts

Pipeline Maintenance Crons

Wire promoteHighEngagementFacts() — Add to daily maintenance cron; query exists at fact-engine-queries.ts:983
Wire abandonStaleSessions() — Add to hourly maintenance cron; query exists at fact-engine-queries.ts:1549
Wire milestone claim processing — Add cron + API route calling claimMilestone() at fact-engine-queries.ts:1207

Backfill Fact Nulls

02-backfill-fact-nulls.md
- Wave 1: Schema & type expansion
- Wave 2: Pipeline wiring fixes
- Wave 3: Backfill existing data (run script against production)

TODO

Difficulty levels 2-5 — Generate harder challenge content (currently only level 1)
Changelog entry — Update docs/changelog/02-2026.md with challenge content feature

Regeneration Runbook

Full Regeneration Procedure

# 1. Export ALL validated facts (both covered and uncovered)
bun scripts/seed/generate-challenge-content.ts --export-all

# 2. Clear old generated files (optional — resume logic handles dedup)
rm scripts/seed/.challenge-data/challenges-generated*.jsonl

# 3. Generate with 8 parallel partitions
for i in 1 2 3 4 5 6 7 8; do
  bun scripts/seed/generate-challenge-content.ts \
    --generate --partition $i/8 --output-suffix regen-p$i --concurrency 5 &
done
wait

# 4. Upload (upsert overwrites existing rows)
bun scripts/seed/generate-challenge-content.ts --upload

# 5. Validate
bun scripts/seed/generate-challenge-content.ts --validate

Cost Tracking

Historical cost data is now tracked in per-job seed logs. See logs/2026-02/index.md for the February 2026 summary.

Legacy table (pre-logging system):

Run	Rows	Cost	Date
Initial bulk (8 partitions)	373,937	~$93	2026-02-17
Recovery (4 partitions)	4,905	~$2	2026-02-18
Full regeneration (CQ-002 + gap fill)	~577K est.	~$85 est.	2026-02-18 (in progress)
Total	~956K est.	~$180 est.

Note: Future runs should create a seed log file instead of adding rows here. See logs/README.md.

Architecture Notes

JSONL pipeline: All AI output writes to local .jsonl files first, then bulk-uploads to DB. This pattern provides crash-resilience and lets us inspect/filter output before committing.
Partition-based parallelism: --partition N/M splits the export into M chunks. Each partition writes to its own file via --output-suffix.
Resume-safe: The --generate phase scans ALL challenges-generated*.jsonl files to find already-processed fact IDs. Interrupted runs can be restarted without re-processing.
Upsert upload: onConflictDoUpdate on (fact_record_id, challenge_style, target_fact_key, difficulty) means regenerated content overwrites old content automatically.
Three-layer CQ-002 enforcement: (1) Prompt-level instruction with examples, (2) Generation-time regex filter drops non-compliant output, (3) Post-upload validation sampling.

#Fact Engine Seeding — TODO

#Context

#Current State

#Completed

#In Progress

#Taxonomy Expansion

#Taxonomy Coherence

#Challenge Content Frontend Integration

#Content Cleanup

#Pipeline Maintenance Crons

#Backfill Fact Nulls

#TODO

#Regeneration Runbook

#Full Regeneration Procedure

#Cost Tracking

#Architecture Notes