Manual Seeding Guide — Non-News Challenge Pipeline

Comprehensive reference for how Eko creates interactive challenge content from non-news sources. Covers the full pipeline: file parsing, AI-powered fact explosion, challenge content generation, content cleanup, and quality enforcement.

Overview

The manual seeding pipeline transforms legacy content files (XLSX, DOCX, CSV) and AI-curated entity lists into structured, quiz-ready facts. Unlike the news pipeline (which ingests current articles via cron), manual seeding is operator-driven and batch-oriented.

┌──────────────────────────────────────────────────────────────────┐
│                     Manual Seeding Pipeline                       │
│                                                                   │
│  Source Files          seed_entry_queue         fact_records       │
│  (XLSX/DOCX/CSV)  ──▶  (DB work queue)   ──▶  (structured facts) │
│        OR                     │                      │            │
│  AI-Curated Entries          Redis             fact_challenge_     │
│  (generate-curated)    (EXPLODE_CATEGORY       content            │
│                         _ENTRY messages)   (6 quiz styles/fact)   │
└──────────────────────────────────────────────────────────────────┘

Pipeline Stages

Stage	Script / Component	Input	Output
1. Parse	`seed-from-files.ts --parse`	Legacy files	`seed_entry_queue` rows
1b. Curate	`generate-curated-entries.ts`	AI + category specs	`seed_entry_queue` rows
2. Dispatch	`bulk-enqueue.ts`	Pending queue entries	Redis messages
3. Explode	`worker-facts` (explode-entry handler)	Redis messages	`fact_records` + spinoffs
4. Validate	`worker-validate`	Pending facts	Validated facts
5. Challenge Gen	`generate-challenge-content.ts`	Validated facts	`fact_challenge_content`
6. Cleanup	`cleanup-content.ts`	All facts	Rewritten titles/context

Taxonomy Resolution

Category mapping is a cross-cutting concern that affects file parsing (Stage 1), curated entry generation (Stage 1b), and news ingestion. The system uses a three-tier resolution strategy to map external category slugs to internal topic_categories rows.

Resolution Order

The resolveTopicCategory() function (packages/db/src/drizzle/fact-engine-queries.ts) resolves slugs in priority order:

Exact slug match — Direct lookup in topic_categories (fastest path)
Provider-specific alias — Matches (external_slug, provider) in topic_category_aliases
Universal alias — Matches (external_slug, NULL provider) in topic_category_aliases
Unresolved — Logs to unmapped_category_log for audit (fire-and-forget, non-blocking)

After migration 0101 expanded the taxonomy to 36+ root categories, most provider slugs now match directly in step 1. Aliases handle the remainder (e.g., general → current-events, health → science).

Alias Table: `topic_category_aliases`

Column	Type	Purpose
`id`	UUID	Primary key
`external_slug`	text	The provider's category name (e.g., `"general"`, `"nation"`)
`provider`	text	Provider name (`"gnews"`, `"newsapi"`) or NULL for universal aliases
`topic_category_id`	UUID	FK to `topic_categories` — the resolved internal category
`created_at`	timestamptz	When the alias was created

Unique constraint: Uses COALESCE(provider, '__universal__') so NULL provider is treated as a distinct value, allowing both provider-specific and universal aliases for the same slug.

Seeded Aliases

Universal (any provider): general → current-events, health → science, tech → technology, politics → governments, world → current-events, food → food-beverage, lifestyle → culture.

GNews-specific: breaking-news → current-events, nation → current-events.

Static Category Mapper (Seeding Pipeline)

For file-based and curated seeding, scripts/seed/lib/category-mapper.ts provides a complementary static mapping layer:

mapRecordToCategory() — Path-based pattern matching (e.g., brainsie/entries/entertainment → entertainment)
normalizeCategorySlug() — Handles common aliases (automotive → auto, events → history)
classifyRichness() — Topic-aware richness tier heuristics (entertainment → high, design → low)

The static mapper handles ~80% of seeded records. Remaining unmapped records fall back to AI batch classification via batchMapCategories().

Taxonomy Reconciliation (Migration 0127)

Migration 0127 resolved conflicts between the DB state and CATEGORY_SPECS:

Deactivated 4 orphan roots: accounting, marketing, spelling-grammar, things (no seed entries, poor fit)
Merged statistical-records → records: Reassigned all fact_records, then deactivated the source category
Propagated challenge formats: Cross-joined format IDs from the original 7 roots to all active expansion roots lacking format links

Depth-Bounded Queries

getActiveTopicCategories() and getActiveTopicCategoriesWithSchemas() accept an optional maxDepth parameter. Cron routes use maxDepth: 0 to dispatch only root-level categories, preventing quota explosion as subcategories are added.

Audit: Unmapped Categories

The unmapped_category_log table captures every slug that fails resolution:

-- Check for unmapped categories (helps identify needed aliases)
SELECT external_slug, provider, COUNT(*) AS occurrences
FROM unmapped_category_log
GROUP BY external_slug, provider
ORDER BY occurrences DESC;

When a slug appears frequently, add it to topic_category_aliases (via migration or direct insert with service_role).

Stage 1: Source Content

Option A: File-Based Seeding

Place legacy content files in .notes/seeding-folder/ (gitignored). The parser supports XLSX, DOCX, and CSV.

Directory structure:

.notes/seeding-folder/
├── brainsie/
│   └── entries/
│       ├── entertainment/   # XLSX with "Card Name" column
│       ├── sports/
│       ├── animals/
│       └── ...
├── jon@sportsformat.com/
│   ├── sf entries/
│   └── events/
├── events/
└── [any custom files]

Content profiles (scripts/seed/lib/content-profiles.ts) map file paths to parsing rules:

Profile Field	Purpose	Example
`filePattern`	Glob matching the file path	`brainsie/entries/entertainment`
`titleColumn`	Column containing the entry name	`Card Name`
`categoryOverride`	Force a topic category	`entertainment`
`richnessTierHint`	Controls fact count per entry	`high` (50-100 facts)
`descriptionColumns`	Additional context columns	`['Card Description']`
`tagColumns`	Tag/label columns	`['Labels']`

First matching profile wins. Unknown files fall back to generic extraction.

Parse command:

# Preview without DB writes
bun scripts/seed/seed-from-files.ts --parse --dry-run

# Insert entries into seed_entry_queue
bun scripts/seed/seed-from-files.ts --parse

Option B: AI-Curated Entry Generation

Skip files entirely. The curated entries script uses CATEGORY_SPECS (defined in scripts/seed/generate-curated-entries.ts, lines 40-728) to AI-generate notable entity names across 40+ topic categories.

Category coverage:

Domain	Example Subcategories	Entries per Subcategory
Entertainment	History, Genres, Albums, Films, TV	50-200
Sports	Soccer Legends, NFL History, NBA, MLB	50-200
Science	Physics & Space, Biology, Chemistry	50-150
Geography	Natural Wonders, Famous Cities, Islands	50-150
History	Ancient Civilizations, Medieval, Modern	50-200
Culture	Religions, Festivals, Languages	50-150
People	World Leaders, Scientists, Activists	50-200

Plus: animals, art, design, fashion, food-beverage, cooking, nature, space, games, travel, finance, math, publishing, places, home-living, geology, events, governments, human-achievement, how-things-work, countries.

Commands:

# Generate entry names (preview)
bun scripts/seed/generate-curated-entries.ts

# Generate and insert into seed_entry_queue
bun scripts/seed/generate-curated-entries.ts --insert

Stage 2: Dispatch to Workers

The seed_entry_queue table holds entries waiting to be "exploded" into individual facts.

Schema:

Column	Type	Purpose
`id`	UUID	Primary key
`name`	text	Entry name (e.g., "Julius Caesar")
`topic_category_id`	UUID	FK to `topic_categories`
`richness_tier`	enum	`high` / `medium` / `low` — controls fact output volume
`source_type`	text	`file_parse` / `ai_super_fact` / `manual`
`status`	enum	`pending` / `processing` / `completed` / `failed`
`batch_id`	UUID	Groups entries from the same parse run
`facts_generated`	int	Counter updated after explosion
`spinoffs_discovered`	int	Counter for discovered related entities
`parent_entry_id`	UUID	For spinoff entries — links to parent
`relationship`	text	How a spinoff relates to its parent

Dispatch command:

# Recommended: fast batch dispatch using Redis pipeline (~60x faster)
bun scripts/seed/bulk-enqueue.ts

# Alternative: one-by-one dispatch (slower, use for debugging)
bun scripts/seed/seed-from-files.ts --explode --batch-size 500

bulk-enqueue.ts queries pending entries in pages and creates EXPLODE_CATEGORY_ENTRY messages in chunks of 500 using enqueueMany().

Stage 3: AI Fact Explosion

Workers consume EXPLODE_CATEGORY_ENTRY messages from the Upstash Redis queue.

Handler: apps/worker-facts/src/handlers/explode-entry.ts

What happens per entry:

Loads the entry name + topic category schema from DB
Calls AI (currently gpt-5-mini via ModelAdapter) with structured output
AI generates 10-100 individual facts, controlled by richness tier:
- high: 50-100 facts (entertainment, sports, famous people)
- medium: 20-50 facts (geography, science, animals)
- low: 10-20 facts (business, design, fashion)
Each fact includes: title, challenge_title, context, notability_score, and structured key-value pairs
Facts are batch-inserted into fact_records with source_type = 'file_seed'
AI may discover spinoffs — related entities (e.g., exploding "Ancient Egypt" discovers "Cleopatra")
Spinoffs are inserted back into seed_entry_queue as new pending entries

Deduplication: Before inserting, the handler calls getExistingTitlesForTopic() to avoid duplicate titles within the same topic category.

Running workers:

# Single worker
bun run dev:worker-facts

# High-throughput: multiple workers with concurrency
WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
WORKER_CONCURRENCY=10 PORT=4011 bun run dev:worker-facts

# Dual API key setup for 2x rate limits
OPENAI_API_KEY=key1 WORKER_CONCURRENCY=10 PORT=4010 bun run dev:worker-facts
OPENAI_API_KEY=key2 WORKER_CONCURRENCY=10 PORT=4020 bun run dev:worker-facts

Throughput:

Configuration	Entries/hour	ETA (20K entries)
1 worker, sequential	~70	~285 hours
3 workers x 3 concurrency	~720	~28 hours
5 workers x 10 concurrency	~1,440	~14 hours
10 workers x 10 (2 keys)	~2,400	~8 hours

Cost: ~$0.002 per entry with gpt-5-mini. Full 20K corpus: ~$40.

Spinoff Processing

Spinoffs create a recursive expansion loop:

Entry: "Ancient Egypt"
  → Explodes into 80 facts
  → Discovers spinoffs: "Cleopatra", "Tutankhamun", "Rosetta Stone"
    → Each spinoff re-enters seed_entry_queue
      → Each gets exploded into its own 30-80 facts
        → May discover further spinoffs

Process spinoffs with:

bun scripts/seed/seed-from-files.ts --explode-spinoffs

Spinoff category inheritance: Spinoffs inherit topic_category_id from their parent entry. If missing (from an older pipeline version), fix with:

UPDATE seed_entry_queue child
SET topic_category_id = parent.topic_category_id
FROM seed_entry_queue parent
WHERE child.parent_entry_id = parent.id
AND child.topic_category_id IS NULL
AND parent.topic_category_id IS NOT NULL;

Stage 4: Fact Validation

After explosion, facts go through the validation pipeline.

Handler: apps/worker-validate/src/handlers/validate-fact.ts

Facts are validated against multiple tiers (authoritative APIs, AI cross-check). Upon successful validation, the handler enqueues:

RESOLVE_IMAGE — find a representative image
GENERATE_CHALLENGE_CONTENT — create interactive quiz content

bun run dev:worker-validate

Stage 5: Challenge Content Generation

The challenge content script pre-generates interactive quiz content for every validated fact.

Script: scripts/seed/generate-challenge-content.ts

Six Challenge Styles

Style	Description	Key `style_data` Fields
`multiple_choice`	Traditional 4-option quiz	`options` (array of 4 choices)
`fill_the_gap`	Sentence with blank(s)	`blank_answer`
`direct_question`	Free-text answer question	—
`statement_blank`	Fill-in-the-blank variant	`blank_answer`
`reverse_lookup`	Given answer, find the question	`distractors`
`free_text`	Open-ended response	—

Two additional styles exist but are runtime-only (not pre-generated):

conversational — multi-turn dialogue
progressive_image_reveal — image-based reveals

Four-Layer Structure (Mandatory per CC-002)

Every pre-generated challenge contains:

setup_text — 2+ sentences of context (shared before asking the question)
challenge_text — The invitation to engage (must address "you"/"your" per CQ-002)
reveal_correct — Celebration text (shown when user answers correctly)
reveal_wrong — Teaching moment (shown when user answers incorrectly)
correct_answer — Rich 3-6 sentence narrative for animated streaming display (per CQ-008)

Database Table: `fact_challenge_content`

Column	Type	Purpose
`id`	UUID	Primary key
`fact_record_id`	UUID	FK to `fact_records`
`challenge_style`	enum	One of the 6 styles above
`setup_text`	text	Context layer
`challenge_text`	text	Question/prompt layer
`reveal_correct`	text	Correct answer feedback
`reveal_wrong`	text	Incorrect answer feedback
`correct_answer`	text	Narrative answer for streaming display
`style_data`	JSONB	Style-specific data (options, blank_answer, etc.)
`target_fact_key`	text	Which fact key this challenge targets
`difficulty`	int	1-5 (currently only level 1 generated)
`ai_model`	text	Model used for generation
`generation_cost_usd`	numeric	Cost tracking

Generation Workflow

# 1. Audit current coverage
bun scripts/seed/generate-challenge-content.ts --audit

# 2. Export facts needing content
bun scripts/seed/generate-challenge-content.ts --export       # Facts with no content
bun scripts/seed/generate-challenge-content.ts --export-all   # ALL validated facts (for regeneration)

# 3. Generate (supports partitioned parallelism)
bun scripts/seed/generate-challenge-content.ts --generate

# Parallel generation with 8 partitions:
for i in 1 2 3 4 5 6 7 8; do
  bun scripts/seed/generate-challenge-content.ts \
    --generate --partition $i/8 --output-suffix regen-p$i --concurrency 5 &
done
wait

# 4. Upload to database (upsert — overwrites existing)
bun scripts/seed/generate-challenge-content.ts --upload

# 5. Validate quality
bun scripts/seed/generate-challenge-content.ts --validate

# 6. Recover weak content (optional)
bun scripts/seed/generate-challenge-content.ts --recover

JSONL Pipeline Architecture

All AI output writes to local .jsonl files first, then bulk-uploads to DB:

scripts/seed/.challenge-data/
├── facts-export.jsonl               # Exported facts needing content
├── challenges-generated.jsonl       # Generated content (single partition)
├── challenges-generated-regen-p1.jsonl  # Partition 1 output
├── challenges-generated-regen-p2.jsonl  # Partition 2 output
└── ...

Resume-safe: The --generate phase scans all challenges-generated*.jsonl files to find already-processed fact IDs. Interrupted runs restart without re-processing.

Upsert upload: onConflictDoUpdate on (fact_record_id, challenge_style, target_fact_key, difficulty) means regenerated content overwrites old content automatically.

Stage 6: Content Cleanup

Full-corpus rewrite of titles, challenge_title, context, notability_score, and notability_reason.

Script: scripts/seed/cleanup-content.ts

# 1. Audit corpus quality baseline
bun scripts/seed/cleanup-content.ts --audit

# 2. Export all facts to local JSONL
bun scripts/seed/cleanup-content.ts --export

# 3. AI rewrite (5 concurrent batches of 20 facts each)
bun scripts/seed/cleanup-content.ts --fix --concurrency 5

# Preview first:
bun scripts/seed/cleanup-content.ts --fix --dry-run --limit 20

# 4. Bulk upload rewrites
bun scripts/seed/cleanup-content.ts --upload

# 5. Validate improvements
bun scripts/seed/cleanup-content.ts --validate

Like challenge generation, cleanup uses local JSONL files for crash-resilience and resume-safety.

Quality Rules

Challenge content quality is governed by rules in docs/rules/challenge-content.md and enforced at multiple layers.

Key Rules

Rule	Requirement	Enforcement
CC-001	Every published fact has content for >= 3 of 6 styles	Audit script
CC-002	Four-layer structure (setup, challenge, reveal_correct, reveal_wrong)	Zod schema at generation time
CC-004	Algorithmic fallback when pre-generated content absent	Frontend code path
CQ-002	`challenge_text` must contain "you" or "your"	Three-layer: prompt instruction, generation-time regex filter, post-upload sampling
CQ-008	`correct_answer` is a 3-6 sentence narrative	Prompt instruction + schema validation

Three-Layer CQ-002 Enforcement

Prompt-level: AI prompt explicitly instructs second-person address with examples
Generation-time filter: Regex /\byou(r|rs|rself)?\b/i drops non-compliant output before writing to JSONL
Post-upload validation: --validate phase samples rows and reports CQ-002 pass rate

Monitoring & Diagnostics

Pipeline Dashboard

bun scripts/seed/seed-from-files.ts --stats

Shows: entries per topic, completion status, facts generated, spinoffs discovered.

Key SQL Queries

-- Entry status distribution
SELECT status, COUNT(*) FROM seed_entry_queue GROUP BY status;

-- Facts by source type
SELECT source_type, COUNT(*) FROM fact_records GROUP BY source_type;

-- Challenge content coverage
SELECT
  (SELECT COUNT(DISTINCT fact_record_id) FROM fact_challenge_content) AS covered_facts,
  (SELECT COUNT(*) FROM fact_records WHERE status = 'validated') AS total_validated;

-- Challenge content by style
SELECT challenge_style, COUNT(*) FROM fact_challenge_content GROUP BY challenge_style;

-- Total AI spend
SELECT SUM(cost_usd::numeric) FROM ai_cost_log WHERE purpose LIKE '%seed%';

-- Unmapped category audit (find slugs needing aliases)
SELECT external_slug, provider, COUNT(*) AS drops
FROM unmapped_category_log
GROUP BY external_slug, provider
ORDER BY drops DESC
LIMIT 20;

-- Active topic categories by depth
SELECT depth, COUNT(*) AS active
FROM topic_categories
WHERE is_active = true
GROUP BY depth
ORDER BY depth;

Cost Summary

Operation	Model	Cost per Unit	Corpus Estimate
Fact explosion	gpt-5-mini	~$0.002/entry	~$40 (20K entries)
Challenge content gen	gpt-5-mini	~$0.0006/fact	~$85 (144K facts)
Content cleanup	gpt-5-mini	~$0.001/fact	~$55 (144K facts)
Total pipeline			~$180

File Reference

File	Purpose
`scripts/seed/seed-from-files.ts`	Main CLI orchestrator (parse, explode, stats)
`scripts/seed/bulk-enqueue.ts`	Fast batch dispatch to Redis
`scripts/seed/generate-curated-entries.ts`	AI-generated entry names from CATEGORY_SPECS
`scripts/seed/generate-challenge-content.ts`	Challenge content generation (6 styles per fact)
`scripts/seed/cleanup-content.ts`	Full-corpus title/context rewrite
`scripts/seed/backfill-fact-nulls.ts`	Fill missing columns in fact_records
`scripts/seed/lib/content-profiles.ts`	File path → parsing rules mapping
`scripts/seed/lib/category-mapper.ts`	Static path-based category mapping + richness classification
`scripts/seed/lib/parsers/`	XLSX, DOCX, CSV file parsers
`apps/worker-facts/src/handlers/explode-entry.ts`	Worker: AI fact explosion
`apps/worker-facts/src/handlers/import-facts.ts`	Worker: batch fact insertion
`apps/worker-facts/src/handlers/generate-challenge-content.ts`	Worker: queue-triggered challenge gen
`apps/worker-validate/src/handlers/validate-fact.ts`	Worker: multi-tier fact validation
`packages/ai/src/challenge-content.ts`	AI challenge generation function
`packages/ai/src/challenge-content-rules.ts`	Validation rules, style constants
`packages/ai/src/seed-explosion.ts`	AI fact explosion function
`packages/db/src/drizzle/fact-engine-queries.ts`	`resolveTopicCategory()`, depth-bounded category queries

Database Tables

Table	Purpose
`seed_entry_queue`	Work queue for entries pending explosion
`fact_records`	Generated facts (`source_type = 'file_seed'` or `'ai_super_fact'`)
`fact_challenge_content`	Pre-generated challenge content (6 styles per fact)
`fact_record_schemas`	Schema definitions per topic category
`topic_categories`	Canonical topic categories (depth 0-2); 5 deactivated after reconciliation
`topic_category_aliases`	Maps external provider slugs to internal categories (three-tier resolution)
`unmapped_category_log`	Audit trail for provider slugs that failed resolution
`super_fact_links`	Cross-entry correlations
`ai_cost_log`	AI spend tracking

Seed Pipeline README — Architecture overview and component map
Seed Pipeline Runbook — Step-by-step operational procedures
Model Evaluation — LLM comparison for seeding tasks
Taxonomy Expansion — GTD project: expanding topic categories
Backfill Fact Nulls — GTD project: fixing NULL columns
Frontend Challenge Content — GTD project: UI integration
Taxonomy Coherence — GTD project: alias resolution and audit logging
Challenge Content Rules — Quality rules (CC-001 through CC-009, CQ-001 through CQ-008)
Seeding TODO — Active work tracker

#Manual Seeding Guide — Non-News Challenge Pipeline

#Overview

#Pipeline Stages

#Taxonomy Resolution

#Resolution Order

#Alias Table: topic_category_aliases

#Seeded Aliases

#Static Category Mapper (Seeding Pipeline)

#Taxonomy Reconciliation (Migration 0127)

#Depth-Bounded Queries

#Audit: Unmapped Categories

#Stage 1: Source Content

#Option A: File-Based Seeding

#Option B: AI-Curated Entry Generation

#Stage 2: Dispatch to Workers

#Stage 3: AI Fact Explosion

#Spinoff Processing

#Stage 4: Fact Validation

#Stage 5: Challenge Content Generation

#Six Challenge Styles

#Four-Layer Structure (Mandatory per CC-002)

#Database Table: fact_challenge_content

#Generation Workflow

#JSONL Pipeline Architecture

#Stage 6: Content Cleanup

#Quality Rules

#Key Rules

#Three-Layer CQ-002 Enforcement

#Monitoring & Diagnostics

#Pipeline Dashboard

#Key SQL Queries

#Cost Summary

#File Reference

#Database Tables

#Related Documents