Seeding Pipeline — Detailed Flow
How Eko bootstraps new topic areas by exploding named entities into structured facts, discovering related entities, and finding cross-entity connections.
Overview
The seeding pipeline is the fastest way to populate a new topic category with high-quality content. Instead of waiting for news articles to trickle in, a single seed entry like "Michael Jordan" can produce 50-100 verified fact cards in one pass — plus automatically discover related entities (Scottie Pippen, Phil Jackson) that feed the next round.
Three source types flow through this pipeline:
| Source Type | What It Produces | Trigger |
|---|---|---|
file_seed | Structured facts exploded from a named entity | Manual seed entry creation |
spinoff_discovery | New seed entries discovered during explosion | Automatic (side effect of explosion) |
ai_super_fact | Cross-entity correlation facts | Batch completion trigger |
All three converge into the same validation → image → challenge content path as news and evergreen facts.
End-to-End Flow
Content team creates seed entry
(name: "Prince", topic: "Music > Artists > Pop", richness: "high")
│
▼
┌─ STEP 1: ENTITY GENERATION ──────────────────────────────────────┐
│ Script: generate-curated-entries.ts │
│ AI generates entity names for a topic category │
│ Inserts into seed_entry_queue (status: pending) │
│ Optional: manual creation via admin dashboard │
└──────────────────────────────────────────────────┬────────────────┘
│
▼
┌─ STEP 2: QUEUE DISPATCH ──────────────────────────────────────────┐
│ Script: bulk-enqueue.ts │
│ Reads pending seed_entry_queue rows │
│ Enqueues EXPLODE_CATEGORY_ENTRY per entry │
└──────────────────────────────────────────────────┬────────────────┘
│
▼
┌─ STEP 3: EXPLOSION ──────────────────────────────────────────────┐
│ Worker: worker-facts │
│ Handler: explode-entry.ts → explodeCategoryEntry() │
│ │
│ Inputs: │
│ - Entity name + aliases │
│ - Topic path + schema keys │
│ - Richness tier → fact count range │
│ - Existing titles (for deduplication) │
│ - Entity context from enrichment APIs (optional) │
│ │
│ Outputs per entity: │
│ - 10-100 structured facts (ExplodedFact[]) │
│ - 3-10 spinoff candidates (SpinoffCandidate[]) │
│ - 0-5 super fact candidates (SuperFactCandidate[]) │
│ - Discovered aliases │
└────────────┬──────────────┬──────────────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌─ STEP 4a ──────┐ ┌─ STEP 4b ──────┐ ┌─ STEP 4c ──────────────┐
│ IMPORT_FACTS │ │ Spinoff insert │ │ Super fact candidates │
│ Batch import │ │ New seed_entry │ │ stored for batch │
│ → fact_records │ │ rows queued │ │ FIND_SUPER_FACTS later │
│ (file_seed) │ │ for future │ │ │
│ │ │ explosion │ │ │
└───────┬────────┘ └────────────────┘ └────────────────────────┘
│
▼
┌─ STEP 5: VALIDATION ─────────────────────────────────────────────┐
│ VALIDATE_FACT enqueued per fact (strategy: multi_phase) │
│ 4-phase: structural → consistency → cross-model → evidence │
│ Stricter than news — no independent sources to corroborate │
└──────────────────────────────────────────────────┬────────────────┘
│
▼
┌─ STEP 6: POST-VALIDATION FAN-OUT ───────────────────────────────┐
│ RESOLVE_IMAGE (parallel) — Wikipedia → SportsDB → Unsplash │
│ GENERATE_CHALLENGE_CONTENT (parallel) — 6 quiz styles │
└──────────────────────────────────────────────────┬────────────────┘
│
▼
Fact appears in feed
Step 1: Entity Generation
How Entities Are Created
Seed entries can be created through multiple paths:
| Method | Script/UI | When to Use |
|---|---|---|
| AI generation | scripts/seed/generate-curated-entries.ts | Bootstrapping a new topic category |
| Manual entry | Admin dashboard → Pipeline → Seed | Adding specific entities |
| File import | scripts/seed/seed-from-files.ts | Bulk import from XLSX/DOCX/CSV |
| Spinoff discovery | Automatic (during explosion) | Expanding from existing entities |
Seed Entry Record
Each seed entry in seed_entry_queue contains:
| Field | Example | Purpose |
|---|---|---|
name | "Michael Jordan" | Entity to explode |
topic_path | "Sports > Basketball > NBA" | Topic classification |
richness_tier | "high" | Controls fact count range |
aliases | ["MJ", "Air Jordan", "His Airness"] | Alternative names for dedup |
status | "pending" | Processing state |
source_type | "manual" or "spinoff_discovery" | How it entered the system |
parent_entry_id | UUID (nullable) | Links spinoffs to parent |
Step 3: Explosion — The Core AI Function
Richness Tiers
The richness tier determines how many facts the AI generates per entity:
| Tier | Fact Range | Typical Topics |
|---|---|---|
high | 50-100 | Entertainment, sports, well-known people |
medium | 20-50 | Geography, science, animals |
low | 10-20 | Business, design, fashion |
Enrichment Context
Before calling the AI, the handler optionally resolves enrichment context from external APIs. This gives the AI grounded data to work with rather than relying purely on its training data.
Always queried (parallel):
| Source | What It Provides |
|---|---|
| Google Knowledge Graph | Entity type, description, official URLs, resultScore |
| Wikidata | Structured properties, sitelink count, identifiers |
| Wikipedia | Summary paragraph, key facts |
Topic-routed (conditional):
| Topic Path | Source | What It Provides |
|---|---|---|
sports/* | TheSportsDB | Team badges, player photos, league data |
music/* | MusicBrainz | Discography, genre, active years |
geography/* | Nominatim | Coordinates, administrative boundaries |
books/* | Open Library | Publication data, author info, ISBNs |
All enrichment calls use Promise.allSettled() — a failing API never blocks the explosion. The merged context string is injected into the AI prompt as grounding data.
Deterministic Notability
When both Knowledge Graph and Wikidata return results, the system can compute a deterministic notability score that overrides the AI's assessment:
- KG
resultScore > 500+ WikidatasitelinkCount > 20→ notability 0.9 - KG
resultScore > 100+ Wikidata found → notability 0.8 - Otherwise → AI's own notability score is used
This prevents the AI from underscoring well-known entities or overscoring obscure ones.
AI Prompt Composition
The explosion prompt includes:
- CHALLENGE_TONE_PREFIX — theatrical, cinematic title requirements
- Taxonomy voice — domain-specific register and energy (via
resolveVoice()) - Domain vocabulary — expert language patterns (via
formatVocabularyForPrompt()) - Taxonomy content rules — formatting conventions (via
resolveContentRules()) - Schema keys — the JSONB fields the AI must produce
- Existing titles — for deduplication (up to 50 titles)
- Enrichment context — grounding data from external APIs
Model Selection
| Aspect | Detail |
|---|---|
| Task name | seed_explosion |
| Default tier | default (cost-optimized for bulk seeding) |
| Can escalate | Yes, if escalation signals present |
| Model override | Supported via modelOverride input |
Spinoff Discovery
During explosion, the AI also discovers related entities. Each spinoff includes:
| Field | Example | Purpose |
|---|---|---|
name | "Scottie Pippen" | Related entity |
suggestedTopicPath | "Sports > Basketball > NBA" | Where to classify |
relationship | "Teammate, Chicago Bulls dynasty" | Why it's related |
priorityScore | 0.85 | Processing priority (0-1) |
Spinoffs are inserted as new seed_entry_queue rows with:
source_type: 'spinoff_discovery'parent_entry_idlinking to the source entity- Entity links created via
insertEntityLink()with connection type'spinoff'
Spinoffs are not immediately processed — they queue up for the next bulk-enqueue run, allowing human review of what the AI discovered.
Super Facts
What They Are
Super facts are surprising connections between entities that wouldn't surface from individual explosions. They require context across multiple entities to discover.
Examples:
- "Michael Jordan and Magic Johnson both won their first NBA championship at age 27"
- "Prince and David Bowie both released landmark albums in 1984"
- "Three Nobel Prize winners in Physics were born in the same small German town"
How They're Found
After a batch of seed entries completes, the FIND_SUPER_FACTS message triggers findSuperFacts():
- Loads 10-50 recently exploded entries from the same topic area
- AI analyzes them for cross-entity connections
- Returns candidates with connection metadata
Connection Types
| Type | Example |
|---|---|
shared_event | Two people at the same historical event |
rivalry | Competitive relationship |
collaboration | Worked together on something notable |
temporal | Same date/year coincidence |
geographic | Same origin or location connection |
causal | One entity influenced another |
Super Fact Record
Each super fact is inserted as a fact_record with:
source_type: 'ai_super_fact'status: 'pending_validation'- Entries in
super_fact_linkstable linking 2-3 entities - Entity links created between all pairs of linked entities
Validation Strategy
Seed-originated facts use the multi_phase validation strategy — the strictest available:
| Phase | Name | What It Checks | Cost |
|---|---|---|---|
| 1 | Structural | Schema conformance, type validation, injection detection | $0 |
| 2 | Consistency | Internal contradictions, taxonomy rule violations | $0 |
| 3 | Cross-Model | AI adversarial verification (different model than generator) | ~$0.001 |
| 4 | Evidence | External API corroboration (Wikipedia, Wikidata) + AI reasoner | ~$0.002-0.005 |
This is stricter than news validation (multi_source) because seed facts have no independent news sources to corroborate. The enrichment data from Step 3 is not reused in validation — the validator independently queries external APIs to avoid circular reasoning.
Utility Scripts
| Script | Purpose | When to Run |
|---|---|---|
generate-curated-entries.ts | AI-generate entity names for a topic | Starting a new category |
bulk-enqueue.ts | Dispatch pending entries to workers | After entity generation |
generate-challenge-content.ts | Batch challenge generation | After validation completes |
seed-from-files.ts | Parse XLSX/DOCX/CSV and seed | Bulk import from spreadsheets |
cleanup-content.ts | Rewrite titles/context | Fixing quality issues |
backfill-fact-nulls.ts | Fill NULL metadata fields | After schema changes |
cleanup-seed-queue.ts | Remove garbage entries | Periodic maintenance |
deepen-topic-paths.ts | AI-reclassify to deeper categories | After adding subcategories |
remap-topic-paths.ts | Batch fix malformed topic paths | After taxonomy changes |
materialize-entity-categories.ts | Create depth-2/3 categories from entities | Expanding taxonomy |
regen-voice-pass.ts | Regenerate voice-pass content | After voice rule changes |
rewrite-challenge-defects.ts | Fix challenge quality defects | QA remediation |
Cost Model
| Component | Per Entity (high tier) | Per Entity (medium tier) |
|---|---|---|
| Explosion AI call | ~$0.05-0.15 | ~$0.02-0.05 |
| Validation (per fact) | ~$0.003 | ~$0.003 |
| Challenge content (per fact) | ~$0.006 | ~$0.006 |
| Enrichment APIs | $0 (all free) | $0 |
| Total (50-100 facts) | ~$0.50-1.00 | ~$0.20-0.50 |
Super fact discovery adds ~$0.02 per batch of 10-50 entries.
Real-World Example: Seeding "Michael Jordan"
Input
name: "Michael Jordan"
topic_path: "Sports > Basketball > NBA"
richness_tier: "high"
aliases: ["MJ", "Air Jordan", "His Airness"]
Enrichment (automatic)
- Knowledge Graph: NBA player, born 1963, 6x champion, resultScore: 2847
- Wikidata: 200+ sitelinks, career stats, team history, awards
- Wikipedia: 4-paragraph summary with career highlights
- TheSportsDB: Team badge URLs, player photo, league metadata
Deterministic notability
- KG resultScore (2847) > 500 ✓
- Wikidata sitelinks (200+) > 20 ✓
- Override: notability = 0.9, method =
kg_wikidata_bypass
Explosion output (abbreviated)
Facts (75 generated):
- "Michael Jordan scored 63 points against the Celtics in the 1986 playoffs — a record that still stands"
- "Jordan was cut from his high school varsity team as a sophomore at Laney High School in Wilmington, NC"
- "His six NBA Finals appearances resulted in six championships — a perfect 6-0 Finals record"
- ... (72 more)
Spinoffs (7 discovered):
- Scottie Pippen (teammate, priority: 0.92)
- Phil Jackson (coach, priority: 0.88)
- Larry Bird (rival, priority: 0.85)
- Space Jam (cultural crossover, priority: 0.72)
- Nike Air Jordan (business impact, priority: 0.70)
- Dean Smith (college coach, priority: 0.65)
- Kobe Bryant (spiritual successor, priority: 0.60)
Super fact candidates (3):
- Jordan and Magic Johnson: championship age coincidence
- Jordan and Larry Bird: playoff rivalry record
- Jordan and Scottie Pippen: combined scoring stats
What happens next
- 75 facts →
IMPORT_FACTS→fact_records(source_type:file_seed) - 75 ×
VALIDATE_FACTmessages enqueued (multi_phase strategy) - 7 spinoff entries →
seed_entry_queue(status: pending, awaiting next bulk-enqueue) - After validation:
RESOLVE_IMAGE+GENERATE_CHALLENGE_CONTENTper fact - ~60-70 facts survive validation → appear in feed within hours
Key Files
| File | Purpose |
|---|---|
packages/ai/src/seed-explosion.ts | explodeCategoryEntry() + findSuperFacts() |
packages/ai/src/enrichment.ts | Enrichment orchestrator (8 free APIs) |
apps/worker-facts/src/handlers/explode-entry.ts | EXPLODE_CATEGORY_ENTRY handler |
apps/worker-facts/src/handlers/import-facts.ts | IMPORT_FACTS handler + validation strategy selection |
apps/worker-facts/src/handlers/find-super-facts.ts | FIND_SUPER_FACTS handler |
scripts/seed/generate-curated-entries.ts | Entity name generation |
scripts/seed/bulk-enqueue.ts | Queue dispatch for pending entries |
packages/shared/src/schemas.ts | Zod schemas for queue messages |
packages/config/src/index.ts | Config and thresholds |
docs/projects/seeding/SEED.md | Operational directives for seeding |
Related
- Fact Ingestion — Source of Truth Map — SOT references for all three pipelines
- News & Fact Engine — System reference
- Evergreen Pipeline — AI-generated timeless facts
- News Pipeline — Current events ingestion
- Fact-Challenge Anatomy — How facts become challenges