Manual Seeding Guide — Non-News Challenge Pipeline
Comprehensive reference for how Eko creates interactive challenge content from non-news sources. Covers the full pipeline: file parsing, AI-powered fact explosion, challenge content generation, content cleanup, and quality enforcement.
Overview
The manual seeding pipeline transforms legacy content files (XLSX, DOCX, CSV) and AI-curated entity lists into structured, quiz-ready facts. Unlike the news pipeline (which ingests current articles via cron), manual seeding is operator-driven and batch-oriented.
┌──────────────────────────────────────────────────────────────────┐
│ Manual Seeding Pipeline │
│ │
│ Source Files seed_entry_queue fact_records │
│ (XLSX/DOCX/CSV) ──▶ (DB work queue) ──▶ (structured facts) │
│ OR │ │ │
│ AI-Curated Entries Redis fact_challenge_ │
│ (generate-curated) (EXPLODE_CATEGORY content │
│ _ENTRY messages) (6 quiz styles/fact) │
└──────────────────────────────────────────────────────────────────┘
Pipeline Stages
| Stage | Script / Component | Input | Output |
|---|---|---|---|
| 1. Parse | seed-from-files.ts --parse | Legacy files | seed_entry_queue rows |
| 1b. Curate | generate-curated-entries.ts | AI + category specs | seed_entry_queue rows |
| 2. Dispatch | bulk-enqueue.ts | Pending queue entries | Redis messages |
| 3. Explode | worker-facts (explode-entry handler) | Redis messages | fact_records + spinoffs |
| 4. Validate | worker-validate | Pending facts | Validated facts |
| 5. Challenge Gen | generate-challenge-content.ts | Validated facts | fact_challenge_content |
| 6. Cleanup | cleanup-content.ts | All facts | Rewritten titles/context |
Taxonomy Resolution
Category mapping is a cross-cutting concern that affects file parsing (Stage 1), curated entry generation (Stage 1b), and news ingestion. The system uses a three-tier resolution strategy to map external category slugs to internal topic_categories rows.
Resolution Order
The resolveTopicCategory() function (packages/db/src/drizzle/fact-engine-queries.ts) resolves slugs in priority order:
- Exact slug match — Direct lookup in
topic_categories(fastest path) - Provider-specific alias — Matches
(external_slug, provider)intopic_category_aliases - Universal alias — Matches
(external_slug, NULL provider)intopic_category_aliases - Unresolved — Logs to
unmapped_category_logfor audit (fire-and-forget, non-blocking)
After migration 0101 expanded the taxonomy to 36+ root categories, most provider slugs now match directly in step 1. Aliases handle the remainder (e.g., general → current-events, health → science).
Alias Table: topic_category_aliases
| Column | Type | Purpose |
|---|---|---|
id | UUID | Primary key |
external_slug | text | The provider's category name (e.g., "general", "nation") |
provider | text | Provider name ("gnews", "newsapi") or NULL for universal aliases |
topic_category_id | UUID | FK to topic_categories — the resolved internal category |
created_at | timestamptz | When the alias was created |
Unique constraint: Uses COALESCE(provider, '__universal__') so NULL provider is treated as a distinct value, allowing both provider-specific and universal aliases for the same slug.
Seeded Aliases
Universal (any provider): general → current-events, health → science, tech → technology, politics → governments, world → current-events, food → food-beverage, lifestyle → culture.
GNews-specific: breaking-news → current-events, nation → current-events.
Static Category Mapper (Seeding Pipeline)
For file-based and curated seeding, scripts/seed/lib/category-mapper.ts provides a complementary static mapping layer:
mapRecordToCategory()— Path-based pattern matching (e.g.,brainsie/entries/entertainment→entertainment)normalizeCategorySlug()— Handles common aliases (automotive→auto,events→history)classifyRichness()— Topic-aware richness tier heuristics (entertainment →high, design →low)
The static mapper handles ~80% of seeded records. Remaining unmapped records fall back to AI batch classification via batchMapCategories().
Taxonomy Reconciliation (Migration 0127)
Migration 0127 resolved conflicts between the DB state and CATEGORY_SPECS:
- Deactivated 4 orphan roots:
accounting,marketing,spelling-grammar,things(no seed entries, poor fit) - Merged
statistical-records→records: Reassigned all fact_records, then deactivated the source category - Propagated challenge formats: Cross-joined format IDs from the original 7 roots to all active expansion roots lacking format links
Depth-Bounded Queries
getActiveTopicCategories() and getActiveTopicCategoriesWithSchemas() accept an optional maxDepth parameter. Cron routes use maxDepth: 0 to dispatch only root-level categories, preventing quota explosion as subcategories are added.
Audit: Unmapped Categories
The unmapped_category_log table captures every slug that fails resolution:
-- Check for unmapped categories (helps identify needed aliases)
SELECT external_slug, provider, COUNT(*) AS occurrences
FROM unmapped_category_log
GROUP BY external_slug, provider
ORDER BY occurrences DESC;
When a slug appears frequently, add it to topic_category_aliases (via migration or direct insert with service_role).
Stage 1: Source Content
Option A: File-Based Seeding
Place legacy content files in .notes/seeding-folder/ (gitignored). The parser supports XLSX, DOCX, and CSV.
Directory structure:
.notes/seeding-folder/
├── brainsie/
│ └── entries/
│ ├── entertainment/ # XLSX with "Card Name" column
│ ├── sports/
│ ├── animals/
│ └── ...
├── jon@sportsformat.com/
│ ├── sf entries/
│ └── events/
├── events/
└── [any custom files]
Content profiles (scripts/seed/lib/content-profiles.ts) map file paths to parsing rules:
| Profile Field | Purpose | Example |
|---|---|---|
filePattern | Glob matching the file path | brainsie/entries/entertainment |
titleColumn | Column containing the entry name | Card Name |
categoryOverride | Force a topic category | entertainment |
richnessTierHint | Controls fact count per entry | high (50-100 facts) |
descriptionColumns | Additional context columns | ['Card Description'] |
tagColumns | Tag/label columns | ['Labels'] |
First matching profile wins. Unknown files fall back to generic extraction.
Parse command:
# Preview without DB writes
bun scripts/seed/seed-from-files.ts --parse --dry-run
# Insert entries into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse
Option B: AI-Curated Entry Generation
Skip files entirely. The curated entries script uses CATEGORY_SPECS (defined in scripts/seed/generate-curated-entries.ts, lines 40-728) to AI-generate notable entity names across 40+ topic categories.
Category coverage:
| Domain | Example Subcategories | Entries per Subcategory |
|---|---|---|
| Entertainment | History, Genres, Albums, Films, TV | 50-200 |
| Sports | Soccer Legends, NFL History, NBA, MLB | 50-200 |
| Science | Physics & Space, Biology, Chemistry | 50-150 |
| Geography | Natural Wonders, Famous Cities, Islands | 50-150 |
| History | Ancient Civilizations, Medieval, Modern | 50-200 |
| Culture | Religions, Festivals, Languages | 50-150 |
| People | World Leaders, Scientists, Activists | 50-200 |
Plus: animals, art, design, fashion, food-beverage, cooking, nature, space, games, travel, finance, math, publishing, places, home-living, geology, events, governments, human-achievement, how-things-work, countries.
Commands:
# Generate entry names (preview)
bun scripts/seed/generate-curated-entries.ts
# Generate and insert into seed_entry_queue
bun scripts/seed/generate-curated-entries.ts --insert
Stage 2: Dispatch to Workers
The seed_entry_queue table holds entries waiting to be "exploded" into individual facts.
Schema:
| Column | Type | Purpose |
|---|---|---|
id | UUID | Primary key |
name | text | Entry name (e.g., "Julius Caesar") |
topic_category_id | UUID | FK to topic_categories |
richness_tier | enum | high / medium / low — controls fact output volume |
source_type | text | file_parse / ai_super_fact / manual |
status | enum | pending / processing / completed / failed |
batch_id | UUID | Groups entries from the same parse run |
facts_generated | int | Counter updated after explosion |
spinoffs_discovered | int | Counter for discovered related entities |
parent_entry_id | UUID | For spinoff entries — links to parent |
relationship | text | How a spinoff relates to its parent |
Dispatch command:
# Recommended: fast batch dispatch using Redis pipeline (~60x faster)
bun scripts/seed/bulk-enqueue.ts
# Alternative: one-by-one dispatch (slower, use for debugging)
bun scripts/seed/seed-from-files.ts --explode --batch-size 500
bulk-enqueue.ts queries pending entries in pages and creates EXPLODE_CATEGORY_ENTRY messages in chunks of 500 using enqueueMany().
Stage 3: AI Fact Explosion
Workers consume EXPLODE_CATEGORY_ENTRY messages from the Upstash Redis queue.
Handler: apps/worker-facts/src/handlers/explode-entry.ts
What happens per entry:
- Loads the entry name + topic category schema from DB
- Calls AI (currently gpt-5-mini via ModelAdapter) with structured output
- AI generates 10-100 individual facts, controlled by richness tier:
- high: 50-100 facts (entertainment, sports, famous people)
- medium: 20-50 facts (geography, science, animals)
- low: 10-20 facts (business, design, fashion)
- Each fact includes: title, challenge_title, context, notability_score, and structured key-value pairs
- Facts are batch-inserted into
fact_recordswithsource_type = 'file_seed' - AI may discover spinoffs — related entities (e.g., exploding "Ancient Egypt" discovers "Cleopatra")
- Spinoffs are inserted back into
seed_entry_queueas new pending entries
Deduplication: Before inserting, the handler calls getExistingTitlesForTopic() to avoid duplicate titles within the same topic category.
Running workers:
# Single worker
bun run dev:worker-facts
# High-throughput: multiple workers with concurrency
WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
WORKER_CONCURRENCY=10 PORT=4011 bun run dev:worker-facts
# Dual API key setup for 2x rate limits
OPENAI_API_KEY=key1 WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
OPENAI_API_KEY=key2 WORKER_CONCURRENCY=10 PORT=4020 bun run dev:worker-facts
Throughput:
| Configuration | Entries/hour | ETA (20K entries) |
|---|---|---|
| 1 worker, sequential | ~70 | ~285 hours |
| 3 workers x 3 concurrency | ~720 | ~28 hours |
| 5 workers x 10 concurrency | ~1,440 | ~14 hours |
| 10 workers x 10 (2 keys) | ~2,400 | ~8 hours |
Cost: ~$0.002 per entry with gpt-5-mini. Full 20K corpus: ~$40.
Spinoff Processing
Spinoffs create a recursive expansion loop:
Entry: "Ancient Egypt"
→ Explodes into 80 facts
→ Discovers spinoffs: "Cleopatra", "Tutankhamun", "Rosetta Stone"
→ Each spinoff re-enters seed_entry_queue
→ Each gets exploded into its own 30-80 facts
→ May discover further spinoffs
Process spinoffs with:
bun scripts/seed/seed-from-files.ts --explode-spinoffs
Spinoff category inheritance: Spinoffs inherit topic_category_id from their parent entry. If missing (from an older pipeline version), fix with:
UPDATE seed_entry_queue child
SET topic_category_id = parent.topic_category_id
FROM seed_entry_queue parent
WHERE child.parent_entry_id = parent.id
AND child.topic_category_id IS NULL
AND parent.topic_category_id IS NOT NULL;
Stage 4: Fact Validation
After explosion, facts go through the validation pipeline.
Handler: apps/worker-validate/src/handlers/validate-fact.ts
Facts are validated against multiple tiers (authoritative APIs, AI cross-check). Upon successful validation, the handler enqueues:
RESOLVE_IMAGE— find a representative imageGENERATE_CHALLENGE_CONTENT— create interactive quiz content
bun run dev:worker-validate
Stage 5: Challenge Content Generation
The challenge content script pre-generates interactive quiz content for every validated fact.
Script: scripts/seed/generate-challenge-content.ts
Six Challenge Styles
| Style | Description | Key style_data Fields |
|---|---|---|
multiple_choice | Traditional 4-option quiz | options (array of 4 choices) |
fill_the_gap | Sentence with blank(s) | blank_answer |
direct_question | Free-text answer question | — |
statement_blank | Fill-in-the-blank variant | blank_answer |
reverse_lookup | Given answer, find the question | distractors |
free_text | Open-ended response | — |
Two additional styles exist but are runtime-only (not pre-generated):
conversational— multi-turn dialogueprogressive_image_reveal— image-based reveals
Four-Layer Structure (Mandatory per CC-002)
Every pre-generated challenge contains:
setup_text— 2+ sentences of context (shared before asking the question)challenge_text— The invitation to engage (must address "you"/"your" per CQ-002)reveal_correct— Celebration text (shown when user answers correctly)reveal_wrong— Teaching moment (shown when user answers incorrectly)correct_answer— Rich 3-6 sentence narrative for animated streaming display (per CQ-008)
Database Table: fact_challenge_content
| Column | Type | Purpose |
|---|---|---|
id | UUID | Primary key |
fact_record_id | UUID | FK to fact_records |
challenge_style | enum | One of the 6 styles above |
setup_text | text | Context layer |
challenge_text | text | Question/prompt layer |
reveal_correct | text | Correct answer feedback |
reveal_wrong | text | Incorrect answer feedback |
correct_answer | text | Narrative answer for streaming display |
style_data | JSONB | Style-specific data (options, blank_answer, etc.) |
target_fact_key | text | Which fact key this challenge targets |
difficulty | int | 1-5 (currently only level 1 generated) |
ai_model | text | Model used for generation |
generation_cost_usd | numeric | Cost tracking |
Generation Workflow
# 1. Audit current coverage
bun scripts/seed/generate-challenge-content.ts --audit
# 2. Export facts needing content
bun scripts/seed/generate-challenge-content.ts --export # Facts with no content
bun scripts/seed/generate-challenge-content.ts --export-all # ALL validated facts (for regeneration)
# 3. Generate (supports partitioned parallelism)
bun scripts/seed/generate-challenge-content.ts --generate
# Parallel generation with 8 partitions:
for i in 1 2 3 4 5 6 7 8; do
bun scripts/seed/generate-challenge-content.ts \
--generate --partition $i/8 --output-suffix regen-p$i --concurrency 5 &
done
wait
# 4. Upload to database (upsert — overwrites existing)
bun scripts/seed/generate-challenge-content.ts --upload
# 5. Validate quality
bun scripts/seed/generate-challenge-content.ts --validate
# 6. Recover weak content (optional)
bun scripts/seed/generate-challenge-content.ts --recover
JSONL Pipeline Architecture
All AI output writes to local .jsonl files first, then bulk-uploads to DB:
scripts/seed/.challenge-data/
├── facts-export.jsonl # Exported facts needing content
├── challenges-generated.jsonl # Generated content (single partition)
├── challenges-generated-regen-p1.jsonl # Partition 1 output
├── challenges-generated-regen-p2.jsonl # Partition 2 output
└── ...
Resume-safe: The --generate phase scans all challenges-generated*.jsonl files to find already-processed fact IDs. Interrupted runs restart without re-processing.
Upsert upload: onConflictDoUpdate on (fact_record_id, challenge_style, target_fact_key, difficulty) means regenerated content overwrites old content automatically.
Stage 6: Content Cleanup
Full-corpus rewrite of titles, challenge_title, context, notability_score, and notability_reason.
Script: scripts/seed/cleanup-content.ts
# 1. Audit corpus quality baseline
bun scripts/seed/cleanup-content.ts --audit
# 2. Export all facts to local JSONL
bun scripts/seed/cleanup-content.ts --export
# 3. AI rewrite (5 concurrent batches of 20 facts each)
bun scripts/seed/cleanup-content.ts --fix --concurrency 5
# Preview first:
bun scripts/seed/cleanup-content.ts --fix --dry-run --limit 20
# 4. Bulk upload rewrites
bun scripts/seed/cleanup-content.ts --upload
# 5. Validate improvements
bun scripts/seed/cleanup-content.ts --validate
Like challenge generation, cleanup uses local JSONL files for crash-resilience and resume-safety.
Quality Rules
Challenge content quality is governed by rules in docs/rules/challenge-content.md and enforced at multiple layers.
Key Rules
| Rule | Requirement | Enforcement |
|---|---|---|
| CC-001 | Every published fact has content for >= 3 of 6 styles | Audit script |
| CC-002 | Four-layer structure (setup, challenge, reveal_correct, reveal_wrong) | Zod schema at generation time |
| CC-004 | Algorithmic fallback when pre-generated content absent | Frontend code path |
| CQ-002 | challenge_text must contain "you" or "your" | Three-layer: prompt instruction, generation-time regex filter, post-upload sampling |
| CQ-008 | correct_answer is a 3-6 sentence narrative | Prompt instruction + schema validation |
Three-Layer CQ-002 Enforcement
- Prompt-level: AI prompt explicitly instructs second-person address with examples
- Generation-time filter: Regex
/\byou(r|rs|rself)?\b/idrops non-compliant output before writing to JSONL - Post-upload validation:
--validatephase samples rows and reports CQ-002 pass rate
Monitoring & Diagnostics
Pipeline Dashboard
bun scripts/seed/seed-from-files.ts --stats
Shows: entries per topic, completion status, facts generated, spinoffs discovered.
Key SQL Queries
-- Entry status distribution
SELECT status, COUNT(*) FROM seed_entry_queue GROUP BY status;
-- Facts by source type
SELECT source_type, COUNT(*) FROM fact_records GROUP BY source_type;
-- Challenge content coverage
SELECT
(SELECT COUNT(DISTINCT fact_record_id) FROM fact_challenge_content) AS covered_facts,
(SELECT COUNT(*) FROM fact_records WHERE status = 'validated') AS total_validated;
-- Challenge content by style
SELECT challenge_style, COUNT(*) FROM fact_challenge_content GROUP BY challenge_style;
-- Total AI spend
SELECT SUM(cost_usd::numeric) FROM ai_cost_log WHERE purpose LIKE '%seed%';
-- Unmapped category audit (find slugs needing aliases)
SELECT external_slug, provider, COUNT(*) AS drops
FROM unmapped_category_log
GROUP BY external_slug, provider
ORDER BY drops DESC
LIMIT 20;
-- Active topic categories by depth
SELECT depth, COUNT(*) AS active
FROM topic_categories
WHERE is_active = true
GROUP BY depth
ORDER BY depth;
Cost Summary
| Operation | Model | Cost per Unit | Corpus Estimate |
|---|---|---|---|
| Fact explosion | gpt-5-mini | ~$0.002/entry | ~$40 (20K entries) |
| Challenge content gen | gpt-5-mini | ~$0.0006/fact | ~$85 (144K facts) |
| Content cleanup | gpt-5-mini | ~$0.001/fact | ~$55 (144K facts) |
| Total pipeline | ~$180 |
File Reference
| File | Purpose |
|---|---|
scripts/seed/seed-from-files.ts | Main CLI orchestrator (parse, explode, stats) |
scripts/seed/bulk-enqueue.ts | Fast batch dispatch to Redis |
scripts/seed/generate-curated-entries.ts | AI-generated entry names from CATEGORY_SPECS |
scripts/seed/generate-challenge-content.ts | Challenge content generation (6 styles per fact) |
scripts/seed/cleanup-content.ts | Full-corpus title/context rewrite |
scripts/seed/backfill-fact-nulls.ts | Fill missing columns in fact_records |
scripts/seed/lib/content-profiles.ts | File path → parsing rules mapping |
scripts/seed/lib/category-mapper.ts | Static path-based category mapping + richness classification |
scripts/seed/lib/parsers/ | XLSX, DOCX, CSV file parsers |
apps/worker-facts/src/handlers/explode-entry.ts | Worker: AI fact explosion |
apps/worker-facts/src/handlers/import-facts.ts | Worker: batch fact insertion |
apps/worker-facts/src/handlers/generate-challenge-content.ts | Worker: queue-triggered challenge gen |
apps/worker-validate/src/handlers/validate-fact.ts | Worker: multi-tier fact validation |
packages/ai/src/challenge-content.ts | AI challenge generation function |
packages/ai/src/challenge-content-rules.ts | Validation rules, style constants |
packages/ai/src/seed-explosion.ts | AI fact explosion function |
packages/db/src/drizzle/fact-engine-queries.ts | resolveTopicCategory(), depth-bounded category queries |
Database Tables
| Table | Purpose |
|---|---|
seed_entry_queue | Work queue for entries pending explosion |
fact_records | Generated facts (source_type = 'file_seed' or 'ai_super_fact') |
fact_challenge_content | Pre-generated challenge content (6 styles per fact) |
fact_record_schemas | Schema definitions per topic category |
topic_categories | Canonical topic categories (depth 0-2); 5 deactivated after reconciliation |
topic_category_aliases | Maps external provider slugs to internal categories (three-tier resolution) |
unmapped_category_log | Audit trail for provider slugs that failed resolution |
super_fact_links | Cross-entry correlations |
ai_cost_log | AI spend tracking |
Related Documents
- Seed Pipeline README — Architecture overview and component map
- Seed Pipeline Runbook — Step-by-step operational procedures
- Model Evaluation — LLM comparison for seeding tasks
- Taxonomy Expansion — GTD project: expanding topic categories
- Backfill Fact Nulls — GTD project: fixing NULL columns
- Frontend Challenge Content — GTD project: UI integration
- Taxonomy Coherence — GTD project: alias resolution and audit logging
- Challenge Content Rules — Quality rules (CC-001 through CC-009, CQ-001 through CQ-008)
- Seeding TODO — Active work tracker