Manual Seeding Guide — Non-News Challenge Pipeline

Comprehensive reference for how Eko creates interactive challenge content from non-news sources. Covers the full pipeline: file parsing, AI-powered fact explosion, challenge content generation, content cleanup, and quality enforcement.

Overview

The manual seeding pipeline transforms legacy content files (XLSX, DOCX, CSV) and AI-curated entity lists into structured, quiz-ready facts. Unlike the news pipeline (which ingests current articles via cron), manual seeding is operator-driven and batch-oriented.

┌──────────────────────────────────────────────────────────────────┐
│                     Manual Seeding Pipeline                       │
│                                                                   │
│  Source Files          seed_entry_queue         fact_records       │
│  (XLSX/DOCX/CSV)  ──▶  (DB work queue)   ──▶  (structured facts) │
│        OR                     │                      │            │
│  AI-Curated Entries          Redis             fact_challenge_     │
│  (generate-curated)    (EXPLODE_CATEGORY       content            │
│                         _ENTRY messages)   (6 quiz styles/fact)   │
└──────────────────────────────────────────────────────────────────┘

Pipeline Stages

StageScript / ComponentInputOutput
1. Parseseed-from-files.ts --parseLegacy filesseed_entry_queue rows
1b. Curategenerate-curated-entries.tsAI + category specsseed_entry_queue rows
2. Dispatchbulk-enqueue.tsPending queue entriesRedis messages
3. Explodeworker-facts (explode-entry handler)Redis messagesfact_records + spinoffs
4. Validateworker-validatePending factsValidated facts
5. Challenge Gengenerate-challenge-content.tsValidated factsfact_challenge_content
6. Cleanupcleanup-content.tsAll factsRewritten titles/context

Taxonomy Resolution

Category mapping is a cross-cutting concern that affects file parsing (Stage 1), curated entry generation (Stage 1b), and news ingestion. The system uses a three-tier resolution strategy to map external category slugs to internal topic_categories rows.

Resolution Order

The resolveTopicCategory() function (packages/db/src/drizzle/fact-engine-queries.ts) resolves slugs in priority order:

  1. Exact slug match — Direct lookup in topic_categories (fastest path)
  2. Provider-specific alias — Matches (external_slug, provider) in topic_category_aliases
  3. Universal alias — Matches (external_slug, NULL provider) in topic_category_aliases
  4. Unresolved — Logs to unmapped_category_log for audit (fire-and-forget, non-blocking)

After migration 0101 expanded the taxonomy to 36+ root categories, most provider slugs now match directly in step 1. Aliases handle the remainder (e.g., generalcurrent-events, healthscience).

Alias Table: topic_category_aliases

ColumnTypePurpose
idUUIDPrimary key
external_slugtextThe provider's category name (e.g., "general", "nation")
providertextProvider name ("gnews", "newsapi") or NULL for universal aliases
topic_category_idUUIDFK to topic_categories — the resolved internal category
created_attimestamptzWhen the alias was created

Unique constraint: Uses COALESCE(provider, '__universal__') so NULL provider is treated as a distinct value, allowing both provider-specific and universal aliases for the same slug.

Seeded Aliases

Universal (any provider): general → current-events, health → science, tech → technology, politics → governments, world → current-events, food → food-beverage, lifestyle → culture.

GNews-specific: breaking-news → current-events, nation → current-events.

Static Category Mapper (Seeding Pipeline)

For file-based and curated seeding, scripts/seed/lib/category-mapper.ts provides a complementary static mapping layer:

  • mapRecordToCategory() — Path-based pattern matching (e.g., brainsie/entries/entertainmententertainment)
  • normalizeCategorySlug() — Handles common aliases (automotiveauto, eventshistory)
  • classifyRichness() — Topic-aware richness tier heuristics (entertainment → high, design → low)

The static mapper handles ~80% of seeded records. Remaining unmapped records fall back to AI batch classification via batchMapCategories().

Taxonomy Reconciliation (Migration 0127)

Migration 0127 resolved conflicts between the DB state and CATEGORY_SPECS:

  • Deactivated 4 orphan roots: accounting, marketing, spelling-grammar, things (no seed entries, poor fit)
  • Merged statistical-recordsrecords: Reassigned all fact_records, then deactivated the source category
  • Propagated challenge formats: Cross-joined format IDs from the original 7 roots to all active expansion roots lacking format links

Depth-Bounded Queries

getActiveTopicCategories() and getActiveTopicCategoriesWithSchemas() accept an optional maxDepth parameter. Cron routes use maxDepth: 0 to dispatch only root-level categories, preventing quota explosion as subcategories are added.

Audit: Unmapped Categories

The unmapped_category_log table captures every slug that fails resolution:

-- Check for unmapped categories (helps identify needed aliases)
SELECT external_slug, provider, COUNT(*) AS occurrences
FROM unmapped_category_log
GROUP BY external_slug, provider
ORDER BY occurrences DESC;

When a slug appears frequently, add it to topic_category_aliases (via migration or direct insert with service_role).


Stage 1: Source Content

Option A: File-Based Seeding

Place legacy content files in .notes/seeding-folder/ (gitignored). The parser supports XLSX, DOCX, and CSV.

Directory structure:

.notes/seeding-folder/
├── brainsie/
│   └── entries/
│       ├── entertainment/   # XLSX with "Card Name" column
│       ├── sports/
│       ├── animals/
│       └── ...
├── jon@sportsformat.com/
│   ├── sf entries/
│   └── events/
├── events/
└── [any custom files]

Content profiles (scripts/seed/lib/content-profiles.ts) map file paths to parsing rules:

Profile FieldPurposeExample
filePatternGlob matching the file pathbrainsie/entries/entertainment
titleColumnColumn containing the entry nameCard Name
categoryOverrideForce a topic categoryentertainment
richnessTierHintControls fact count per entryhigh (50-100 facts)
descriptionColumnsAdditional context columns['Card Description']
tagColumnsTag/label columns['Labels']

First matching profile wins. Unknown files fall back to generic extraction.

Parse command:

# Preview without DB writes
bun scripts/seed/seed-from-files.ts --parse --dry-run

# Insert entries into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse

Option B: AI-Curated Entry Generation

Skip files entirely. The curated entries script uses CATEGORY_SPECS (defined in scripts/seed/generate-curated-entries.ts, lines 40-728) to AI-generate notable entity names across 40+ topic categories.

Category coverage:

DomainExample SubcategoriesEntries per Subcategory
EntertainmentHistory, Genres, Albums, Films, TV50-200
SportsSoccer Legends, NFL History, NBA, MLB50-200
SciencePhysics & Space, Biology, Chemistry50-150
GeographyNatural Wonders, Famous Cities, Islands50-150
HistoryAncient Civilizations, Medieval, Modern50-200
CultureReligions, Festivals, Languages50-150
PeopleWorld Leaders, Scientists, Activists50-200

Plus: animals, art, design, fashion, food-beverage, cooking, nature, space, games, travel, finance, math, publishing, places, home-living, geology, events, governments, human-achievement, how-things-work, countries.

Commands:

# Generate entry names (preview)
bun scripts/seed/generate-curated-entries.ts

# Generate and insert into seed_entry_queue
bun scripts/seed/generate-curated-entries.ts --insert

Stage 2: Dispatch to Workers

The seed_entry_queue table holds entries waiting to be "exploded" into individual facts.

Schema:

ColumnTypePurpose
idUUIDPrimary key
nametextEntry name (e.g., "Julius Caesar")
topic_category_idUUIDFK to topic_categories
richness_tierenumhigh / medium / low — controls fact output volume
source_typetextfile_parse / ai_super_fact / manual
statusenumpending / processing / completed / failed
batch_idUUIDGroups entries from the same parse run
facts_generatedintCounter updated after explosion
spinoffs_discoveredintCounter for discovered related entities
parent_entry_idUUIDFor spinoff entries — links to parent
relationshiptextHow a spinoff relates to its parent

Dispatch command:

# Recommended: fast batch dispatch using Redis pipeline (~60x faster)
bun scripts/seed/bulk-enqueue.ts

# Alternative: one-by-one dispatch (slower, use for debugging)
bun scripts/seed/seed-from-files.ts --explode --batch-size 500

bulk-enqueue.ts queries pending entries in pages and creates EXPLODE_CATEGORY_ENTRY messages in chunks of 500 using enqueueMany().


Stage 3: AI Fact Explosion

Workers consume EXPLODE_CATEGORY_ENTRY messages from the Upstash Redis queue.

Handler: apps/worker-facts/src/handlers/explode-entry.ts

What happens per entry:

  1. Loads the entry name + topic category schema from DB
  2. Calls AI (currently gpt-5-mini via ModelAdapter) with structured output
  3. AI generates 10-100 individual facts, controlled by richness tier:
    • high: 50-100 facts (entertainment, sports, famous people)
    • medium: 20-50 facts (geography, science, animals)
    • low: 10-20 facts (business, design, fashion)
  4. Each fact includes: title, challenge_title, context, notability_score, and structured key-value pairs
  5. Facts are batch-inserted into fact_records with source_type = 'file_seed'
  6. AI may discover spinoffs — related entities (e.g., exploding "Ancient Egypt" discovers "Cleopatra")
  7. Spinoffs are inserted back into seed_entry_queue as new pending entries

Deduplication: Before inserting, the handler calls getExistingTitlesForTopic() to avoid duplicate titles within the same topic category.

Running workers:

# Single worker
bun run dev:worker-facts

# High-throughput: multiple workers with concurrency
WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
WORKER_CONCURRENCY=10 PORT=4011 bun run dev:worker-facts

# Dual API key setup for 2x rate limits
OPENAI_API_KEY=key1 WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
OPENAI_API_KEY=key2 WORKER_CONCURRENCY=10 PORT=4020 bun run dev:worker-facts

Throughput:

ConfigurationEntries/hourETA (20K entries)
1 worker, sequential~70~285 hours
3 workers x 3 concurrency~720~28 hours
5 workers x 10 concurrency~1,440~14 hours
10 workers x 10 (2 keys)~2,400~8 hours

Cost: ~$0.002 per entry with gpt-5-mini. Full 20K corpus: ~$40.

Spinoff Processing

Spinoffs create a recursive expansion loop:

Entry: "Ancient Egypt"
  → Explodes into 80 facts
  → Discovers spinoffs: "Cleopatra", "Tutankhamun", "Rosetta Stone"
    → Each spinoff re-enters seed_entry_queue
      → Each gets exploded into its own 30-80 facts
        → May discover further spinoffs

Process spinoffs with:

bun scripts/seed/seed-from-files.ts --explode-spinoffs

Spinoff category inheritance: Spinoffs inherit topic_category_id from their parent entry. If missing (from an older pipeline version), fix with:

UPDATE seed_entry_queue child
SET topic_category_id = parent.topic_category_id
FROM seed_entry_queue parent
WHERE child.parent_entry_id = parent.id
AND child.topic_category_id IS NULL
AND parent.topic_category_id IS NOT NULL;

Stage 4: Fact Validation

After explosion, facts go through the validation pipeline.

Handler: apps/worker-validate/src/handlers/validate-fact.ts

Facts are validated against multiple tiers (authoritative APIs, AI cross-check). Upon successful validation, the handler enqueues:

  • RESOLVE_IMAGE — find a representative image
  • GENERATE_CHALLENGE_CONTENT — create interactive quiz content
bun run dev:worker-validate

Stage 5: Challenge Content Generation

The challenge content script pre-generates interactive quiz content for every validated fact.

Script: scripts/seed/generate-challenge-content.ts

Six Challenge Styles

StyleDescriptionKey style_data Fields
multiple_choiceTraditional 4-option quizoptions (array of 4 choices)
fill_the_gapSentence with blank(s)blank_answer
direct_questionFree-text answer question
statement_blankFill-in-the-blank variantblank_answer
reverse_lookupGiven answer, find the questiondistractors
free_textOpen-ended response

Two additional styles exist but are runtime-only (not pre-generated):

  • conversational — multi-turn dialogue
  • progressive_image_reveal — image-based reveals

Four-Layer Structure (Mandatory per CC-002)

Every pre-generated challenge contains:

  1. setup_text — 2+ sentences of context (shared before asking the question)
  2. challenge_text — The invitation to engage (must address "you"/"your" per CQ-002)
  3. reveal_correct — Celebration text (shown when user answers correctly)
  4. reveal_wrong — Teaching moment (shown when user answers incorrectly)
  5. correct_answer — Rich 3-6 sentence narrative for animated streaming display (per CQ-008)

Database Table: fact_challenge_content

ColumnTypePurpose
idUUIDPrimary key
fact_record_idUUIDFK to fact_records
challenge_styleenumOne of the 6 styles above
setup_texttextContext layer
challenge_texttextQuestion/prompt layer
reveal_correcttextCorrect answer feedback
reveal_wrongtextIncorrect answer feedback
correct_answertextNarrative answer for streaming display
style_dataJSONBStyle-specific data (options, blank_answer, etc.)
target_fact_keytextWhich fact key this challenge targets
difficultyint1-5 (currently only level 1 generated)
ai_modeltextModel used for generation
generation_cost_usdnumericCost tracking

Generation Workflow

# 1. Audit current coverage
bun scripts/seed/generate-challenge-content.ts --audit

# 2. Export facts needing content
bun scripts/seed/generate-challenge-content.ts --export       # Facts with no content
bun scripts/seed/generate-challenge-content.ts --export-all   # ALL validated facts (for regeneration)

# 3. Generate (supports partitioned parallelism)
bun scripts/seed/generate-challenge-content.ts --generate

# Parallel generation with 8 partitions:
for i in 1 2 3 4 5 6 7 8; do
  bun scripts/seed/generate-challenge-content.ts \
    --generate --partition $i/8 --output-suffix regen-p$i --concurrency 5 &
done
wait

# 4. Upload to database (upsert — overwrites existing)
bun scripts/seed/generate-challenge-content.ts --upload

# 5. Validate quality
bun scripts/seed/generate-challenge-content.ts --validate

# 6. Recover weak content (optional)
bun scripts/seed/generate-challenge-content.ts --recover

JSONL Pipeline Architecture

All AI output writes to local .jsonl files first, then bulk-uploads to DB:

scripts/seed/.challenge-data/
├── facts-export.jsonl               # Exported facts needing content
├── challenges-generated.jsonl       # Generated content (single partition)
├── challenges-generated-regen-p1.jsonl  # Partition 1 output
├── challenges-generated-regen-p2.jsonl  # Partition 2 output
└── ...

Resume-safe: The --generate phase scans all challenges-generated*.jsonl files to find already-processed fact IDs. Interrupted runs restart without re-processing.

Upsert upload: onConflictDoUpdate on (fact_record_id, challenge_style, target_fact_key, difficulty) means regenerated content overwrites old content automatically.


Stage 6: Content Cleanup

Full-corpus rewrite of titles, challenge_title, context, notability_score, and notability_reason.

Script: scripts/seed/cleanup-content.ts

# 1. Audit corpus quality baseline
bun scripts/seed/cleanup-content.ts --audit

# 2. Export all facts to local JSONL
bun scripts/seed/cleanup-content.ts --export

# 3. AI rewrite (5 concurrent batches of 20 facts each)
bun scripts/seed/cleanup-content.ts --fix --concurrency 5

# Preview first:
bun scripts/seed/cleanup-content.ts --fix --dry-run --limit 20

# 4. Bulk upload rewrites
bun scripts/seed/cleanup-content.ts --upload

# 5. Validate improvements
bun scripts/seed/cleanup-content.ts --validate

Like challenge generation, cleanup uses local JSONL files for crash-resilience and resume-safety.


Quality Rules

Challenge content quality is governed by rules in docs/rules/challenge-content.md and enforced at multiple layers.

Key Rules

RuleRequirementEnforcement
CC-001Every published fact has content for >= 3 of 6 stylesAudit script
CC-002Four-layer structure (setup, challenge, reveal_correct, reveal_wrong)Zod schema at generation time
CC-004Algorithmic fallback when pre-generated content absentFrontend code path
CQ-002challenge_text must contain "you" or "your"Three-layer: prompt instruction, generation-time regex filter, post-upload sampling
CQ-008correct_answer is a 3-6 sentence narrativePrompt instruction + schema validation

Three-Layer CQ-002 Enforcement

  1. Prompt-level: AI prompt explicitly instructs second-person address with examples
  2. Generation-time filter: Regex /\byou(r|rs|rself)?\b/i drops non-compliant output before writing to JSONL
  3. Post-upload validation: --validate phase samples rows and reports CQ-002 pass rate

Monitoring & Diagnostics

Pipeline Dashboard

bun scripts/seed/seed-from-files.ts --stats

Shows: entries per topic, completion status, facts generated, spinoffs discovered.

Key SQL Queries

-- Entry status distribution
SELECT status, COUNT(*) FROM seed_entry_queue GROUP BY status;

-- Facts by source type
SELECT source_type, COUNT(*) FROM fact_records GROUP BY source_type;

-- Challenge content coverage
SELECT
  (SELECT COUNT(DISTINCT fact_record_id) FROM fact_challenge_content) AS covered_facts,
  (SELECT COUNT(*) FROM fact_records WHERE status = 'validated') AS total_validated;

-- Challenge content by style
SELECT challenge_style, COUNT(*) FROM fact_challenge_content GROUP BY challenge_style;

-- Total AI spend
SELECT SUM(cost_usd::numeric) FROM ai_cost_log WHERE purpose LIKE '%seed%';

-- Unmapped category audit (find slugs needing aliases)
SELECT external_slug, provider, COUNT(*) AS drops
FROM unmapped_category_log
GROUP BY external_slug, provider
ORDER BY drops DESC
LIMIT 20;

-- Active topic categories by depth
SELECT depth, COUNT(*) AS active
FROM topic_categories
WHERE is_active = true
GROUP BY depth
ORDER BY depth;

Cost Summary

OperationModelCost per UnitCorpus Estimate
Fact explosiongpt-5-mini~$0.002/entry~$40 (20K entries)
Challenge content gengpt-5-mini~$0.0006/fact~$85 (144K facts)
Content cleanupgpt-5-mini~$0.001/fact~$55 (144K facts)
Total pipeline~$180

File Reference

FilePurpose
scripts/seed/seed-from-files.tsMain CLI orchestrator (parse, explode, stats)
scripts/seed/bulk-enqueue.tsFast batch dispatch to Redis
scripts/seed/generate-curated-entries.tsAI-generated entry names from CATEGORY_SPECS
scripts/seed/generate-challenge-content.tsChallenge content generation (6 styles per fact)
scripts/seed/cleanup-content.tsFull-corpus title/context rewrite
scripts/seed/backfill-fact-nulls.tsFill missing columns in fact_records
scripts/seed/lib/content-profiles.tsFile path → parsing rules mapping
scripts/seed/lib/category-mapper.tsStatic path-based category mapping + richness classification
scripts/seed/lib/parsers/XLSX, DOCX, CSV file parsers
apps/worker-facts/src/handlers/explode-entry.tsWorker: AI fact explosion
apps/worker-facts/src/handlers/import-facts.tsWorker: batch fact insertion
apps/worker-facts/src/handlers/generate-challenge-content.tsWorker: queue-triggered challenge gen
apps/worker-validate/src/handlers/validate-fact.tsWorker: multi-tier fact validation
packages/ai/src/challenge-content.tsAI challenge generation function
packages/ai/src/challenge-content-rules.tsValidation rules, style constants
packages/ai/src/seed-explosion.tsAI fact explosion function
packages/db/src/drizzle/fact-engine-queries.tsresolveTopicCategory(), depth-bounded category queries

Database Tables

TablePurpose
seed_entry_queueWork queue for entries pending explosion
fact_recordsGenerated facts (source_type = 'file_seed' or 'ai_super_fact')
fact_challenge_contentPre-generated challenge content (6 styles per fact)
fact_record_schemasSchema definitions per topic category
topic_categoriesCanonical topic categories (depth 0-2); 5 deactivated after reconciliation
topic_category_aliasesMaps external provider slugs to internal categories (three-tier resolution)
unmapped_category_logAudit trail for provider slugs that failed resolution
super_fact_linksCross-entry correlations
ai_cost_logAI spend tracking