Fact Engine Seeding — TODO
Tracks progress across fact engine seeding work: challenge content generation, taxonomy expansion, frontend integration, content cleanup, and pipeline maintenance.
Context
- Seeding control prompt: SEED.md — edit this to direct seeding operations
- Best practices: seeding-best-practices.md — strategies, examples, cost management
- Runbook: runbook.md — step-by-step operational procedures
- Challenge content table:
fact_challenge_content(migration 0121,correct_answeradded in 0122) - Generation script:
scripts/seed/generate-challenge-content.ts - Cleanup script:
scripts/seed/cleanup-content.ts - AI model: gpt-5-mini (via ModelAdapter; see model-code-isolation.md)
- Rules:
docs/rules/challenge-content.md(CC-001 through CC-009, CQ-001 through CQ-008)
Current State
| Metric | Value |
|---|---|
| Total validated facts | 144,310 |
| Facts with challenge content (pre-regen) | 93,901 (65.1%) |
| Challenge rows in DB (pre-regen) | 378,816 |
| Full regeneration | In progress — targeting 100% coverage |
correct_answer column | Backfilling (all pre-regen rows are NULL) |
| CQ-002 compliance (pre-regen) | ~60% on original 373K rows |
| Total AI spend to date | ~$95 (+ ~$85 est. for in-progress regeneration) |
Completed
- Step 0: Challenge content rules + AI constitution (
challenge-content-rules.ts) - Step 1: DB migration (
0121_add_fact_challenge_content.sql) - Step 2: Drizzle schema + queries (
factChallengeContenttable, relations) - Step 3: AI generation function (
packages/ai/src/challenge-content.ts) - Step 4: Register
challenge_content_generationtask type in fact-engine - Step 5a: Initial bulk generation — 373,937 rows across 8 partitions (~$93)
- Step 5b: Recovery generation — 4,905 rows for 1,979 previously uncovered facts (~$2)
- Step 5c: Upload all content to DB (378,816 rows)
- CQ-002 prompt strengthening (explicit formatting rules + examples in AI prompt)
- CQ-002 regex fix (
/\byou\b/i→/\byou(r|rs|rself)?\b/i) - CQ-002 generation-time filter (drop non-compliant challenges before JSONL write)
- Add
correct_answercolumn tofact_challenge_content(migration 0122) - CQ-008 rule + AI prompt includes
correct_answer(3-6 sentence narrative for streaming display)
In Progress
- Full regeneration (CQ-002 fix + gap fill + correct_answer backfill) — Regenerate ALL 144,310 validated facts with improved prompt
- Fixes CQ-002 compliance on the 93,901 facts that had ~60% pass rate
- Fills the 50,409-fact coverage gap (facts that never had challenge content)
- Backfills
correct_answercolumn (migration 0122) — all existing rows have NULL from pre-column generation - Target: 100% fact coverage + >= 95% CQ-002 pass rate + correct_answer populated on all rows
- Estimated cost: ~$85 (144,310 facts vs. previous estimate of ~$55 for 93,901)
- Estimated time: ~11-12 hours with 8 partitions at concurrency 5
- Approach:
--export-allexports all 144,310 validated facts,--generatewith 8 partitions,--uploadupserts to overwrite existing + insert new rows - Generation (8 partitions running since Feb 18 14:22)
- Upload (
--uploadupsert after generation completes) - Validate (
--validatesample for CQ-002 pass rate)
Taxonomy Expansion
- 01-taxonomy-expansion.md <- NEXT
- Create challenge document
- Evaluate challenges
- Implement Wave 1 (migration + safeguards)
- Implement Wave 2 (entity materialization script)
- Implement Wave 3 (curated entries integration)
- Re-evaluate until PASS
Taxonomy Coherence
- 04-taxonomy-coherence.md <- NEXT (parallel with taxonomy expansion)
- Create challenge document
- Implement Wave 1: category alias table + resolveTopicCategory() + extract-facts update
- Evaluate Wave 1 challenges
- Implement Wave 2 (unmapped category audit script)
- Implement Wave 3 (depth-aware cron dispatch + subcategory routing)
- Re-evaluate until PASS
Challenge Content Frontend Integration
- 03-frontend-challenge-content.md — GTD challenge doc (M complexity)
- Create challenge document
- Evaluate challenges
- Implement Wave 1 (API endpoint + data loading)
- Implement Wave 2 (UI component updates)
- Implement Wave 3 (queue integration for new facts)
- Re-evaluate until PASS
Content Cleanup
- Content cleanup pass — Rewrite titles/context across 144K facts for quality
- Script:
scripts/seed/cleanup-content.ts(5 phases: audit, export, fix, upload, validate) - Run
--auditto assess corpus quality baseline - Run
--export+--fixwith partitioned generation - Run
--upload+--validate - Wire maintenance cron for periodic re-runs on new facts
- Script:
Pipeline Maintenance Crons
- Wire
promoteHighEngagementFacts()— Add to daily maintenance cron; query exists atfact-engine-queries.ts:983 - Wire
abandonStaleSessions()— Add to hourly maintenance cron; query exists atfact-engine-queries.ts:1549 - Wire milestone claim processing — Add cron + API route calling
claimMilestone()atfact-engine-queries.ts:1207
Backfill Fact Nulls
- 02-backfill-fact-nulls.md
- Wave 1: Schema & type expansion
- Wave 2: Pipeline wiring fixes
- Wave 3: Backfill existing data (run script against production)
TODO
- Difficulty levels 2-5 — Generate harder challenge content (currently only level 1)
- Changelog entry — Update
docs/changelog/02-2026.mdwith challenge content feature
Regeneration Runbook
Full Regeneration Procedure
# 1. Export ALL validated facts (both covered and uncovered)
bun scripts/seed/generate-challenge-content.ts --export-all
# 2. Clear old generated files (optional — resume logic handles dedup)
rm scripts/seed/.challenge-data/challenges-generated*.jsonl
# 3. Generate with 8 parallel partitions
for i in 1 2 3 4 5 6 7 8; do
bun scripts/seed/generate-challenge-content.ts \
--generate --partition $i/8 --output-suffix regen-p$i --concurrency 5 &
done
wait
# 4. Upload (upsert overwrites existing rows)
bun scripts/seed/generate-challenge-content.ts --upload
# 5. Validate
bun scripts/seed/generate-challenge-content.ts --validate
Cost Tracking
Historical cost data is now tracked in per-job seed logs. See logs/2026-02/index.md for the February 2026 summary.
Legacy table (pre-logging system):
| Run | Rows | Cost | Date |
|---|---|---|---|
| Initial bulk (8 partitions) | 373,937 | ~$93 | 2026-02-17 |
| Recovery (4 partitions) | 4,905 | ~$2 | 2026-02-18 |
| Full regeneration (CQ-002 + gap fill) | ~577K est. | ~$85 est. | 2026-02-18 (in progress) |
| Total | ~956K est. | ~$180 est. |
Note: Future runs should create a seed log file instead of adding rows here. See logs/README.md.
Architecture Notes
- JSONL pipeline: All AI output writes to local
.jsonlfiles first, then bulk-uploads to DB. This pattern provides crash-resilience and lets us inspect/filter output before committing. - Partition-based parallelism:
--partition N/Msplits the export into M chunks. Each partition writes to its own file via--output-suffix. - Resume-safe: The
--generatephase scans ALLchallenges-generated*.jsonlfiles to find already-processed fact IDs. Interrupted runs can be restarted without re-processing. - Upsert upload:
onConflictDoUpdateon(fact_record_id, challenge_style, target_fact_key, difficulty)means regenerated content overwrites old content automatically. - Three-layer CQ-002 enforcement: (1) Prompt-level instruction with examples, (2) Generation-time regex filter drops non-compliant output, (3) Post-upload validation sampling.