Seeding Pipeline — Detailed Flow

How Eko bootstraps new topic areas by exploding named entities into structured facts, discovering related entities, and finding cross-entity connections.

Overview

The seeding pipeline is the fastest way to populate a new topic category with high-quality content. Instead of waiting for news articles to trickle in, a single seed entry like "Michael Jordan" can produce 50-100 verified fact cards in one pass — plus automatically discover related entities (Scottie Pippen, Phil Jackson) that feed the next round.

Three source types flow through this pipeline:

Source TypeWhat It ProducesTrigger
file_seedStructured facts exploded from a named entityManual seed entry creation
spinoff_discoveryNew seed entries discovered during explosionAutomatic (side effect of explosion)
ai_super_factCross-entity correlation factsBatch completion trigger

All three converge into the same validation → image → challenge content path as news and evergreen facts.


End-to-End Flow

  Content team creates seed entry
  (name: "Prince", topic: "Music > Artists > Pop", richness: "high")
         │
         ▼
  ┌─ STEP 1: ENTITY GENERATION ──────────────────────────────────────┐
  │  Script: generate-curated-entries.ts                              │
  │  AI generates entity names for a topic category                   │
  │  Inserts into seed_entry_queue (status: pending)                  │
  │  Optional: manual creation via admin dashboard                    │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ STEP 2: QUEUE DISPATCH ──────────────────────────────────────────┐
  │  Script: bulk-enqueue.ts                                          │
  │  Reads pending seed_entry_queue rows                              │
  │  Enqueues EXPLODE_CATEGORY_ENTRY per entry                        │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ STEP 3: EXPLOSION ──────────────────────────────────────────────┐
  │  Worker: worker-facts                                             │
  │  Handler: explode-entry.ts → explodeCategoryEntry()               │
  │                                                                   │
  │  Inputs:                                                          │
  │  - Entity name + aliases                                          │
  │  - Topic path + schema keys                                       │
  │  - Richness tier → fact count range                               │
  │  - Existing titles (for deduplication)                             │
  │  - Entity context from enrichment APIs (optional)                 │
  │                                                                   │
  │  Outputs per entity:                                              │
  │  - 10-100 structured facts (ExplodedFact[])                       │
  │  - 3-10 spinoff candidates (SpinoffCandidate[])                   │
  │  - 0-5 super fact candidates (SuperFactCandidate[])               │
  │  - Discovered aliases                                             │
  └────────────┬──────────────┬──────────────────────┬────────────────┘
               │              │                      │
               ▼              ▼                      ▼
  ┌─ STEP 4a ──────┐  ┌─ STEP 4b ──────┐  ┌─ STEP 4c ──────────────┐
  │ IMPORT_FACTS    │  │ Spinoff insert │  │ Super fact candidates   │
  │ Batch import    │  │ New seed_entry │  │ stored for batch        │
  │ → fact_records  │  │ rows queued    │  │ FIND_SUPER_FACTS later  │
  │ (file_seed)     │  │ for future     │  │                        │
  │                 │  │ explosion      │  │                        │
  └───────┬────────┘  └────────────────┘  └────────────────────────┘
          │
          ▼
  ┌─ STEP 5: VALIDATION ─────────────────────────────────────────────┐
  │  VALIDATE_FACT enqueued per fact (strategy: multi_phase)          │
  │  4-phase: structural → consistency → cross-model → evidence      │
  │  Stricter than news — no independent sources to corroborate      │
  └──────────────────────────────────────────────────┬────────────────┘
          │
          ▼
  ┌─ STEP 6: POST-VALIDATION FAN-OUT ───────────────────────────────┐
  │  RESOLVE_IMAGE (parallel) — Wikipedia → SportsDB → Unsplash     │
  │  GENERATE_CHALLENGE_CONTENT (parallel) — 6 quiz styles          │
  └──────────────────────────────────────────────────┬────────────────┘
          │
          ▼
     Fact appears in feed

Step 1: Entity Generation

How Entities Are Created

Seed entries can be created through multiple paths:

MethodScript/UIWhen to Use
AI generationscripts/seed/generate-curated-entries.tsBootstrapping a new topic category
Manual entryAdmin dashboard → Pipeline → SeedAdding specific entities
File importscripts/seed/seed-from-files.tsBulk import from XLSX/DOCX/CSV
Spinoff discoveryAutomatic (during explosion)Expanding from existing entities

Seed Entry Record

Each seed entry in seed_entry_queue contains:

FieldExamplePurpose
name"Michael Jordan"Entity to explode
topic_path"Sports > Basketball > NBA"Topic classification
richness_tier"high"Controls fact count range
aliases["MJ", "Air Jordan", "His Airness"]Alternative names for dedup
status"pending"Processing state
source_type"manual" or "spinoff_discovery"How it entered the system
parent_entry_idUUID (nullable)Links spinoffs to parent

Step 3: Explosion — The Core AI Function

Richness Tiers

The richness tier determines how many facts the AI generates per entity:

TierFact RangeTypical Topics
high50-100Entertainment, sports, well-known people
medium20-50Geography, science, animals
low10-20Business, design, fashion

Enrichment Context

Before calling the AI, the handler optionally resolves enrichment context from external APIs. This gives the AI grounded data to work with rather than relying purely on its training data.

Always queried (parallel):

SourceWhat It Provides
Google Knowledge GraphEntity type, description, official URLs, resultScore
WikidataStructured properties, sitelink count, identifiers
WikipediaSummary paragraph, key facts

Topic-routed (conditional):

Topic PathSourceWhat It Provides
sports/*TheSportsDBTeam badges, player photos, league data
music/*MusicBrainzDiscography, genre, active years
geography/*NominatimCoordinates, administrative boundaries
books/*Open LibraryPublication data, author info, ISBNs

All enrichment calls use Promise.allSettled() — a failing API never blocks the explosion. The merged context string is injected into the AI prompt as grounding data.

Deterministic Notability

When both Knowledge Graph and Wikidata return results, the system can compute a deterministic notability score that overrides the AI's assessment:

  • KG resultScore > 500 + Wikidata sitelinkCount > 20 → notability 0.9
  • KG resultScore > 100 + Wikidata found → notability 0.8
  • Otherwise → AI's own notability score is used

This prevents the AI from underscoring well-known entities or overscoring obscure ones.

AI Prompt Composition

The explosion prompt includes:

  1. CHALLENGE_TONE_PREFIX — theatrical, cinematic title requirements
  2. Taxonomy voice — domain-specific register and energy (via resolveVoice())
  3. Domain vocabulary — expert language patterns (via formatVocabularyForPrompt())
  4. Taxonomy content rules — formatting conventions (via resolveContentRules())
  5. Schema keys — the JSONB fields the AI must produce
  6. Existing titles — for deduplication (up to 50 titles)
  7. Enrichment context — grounding data from external APIs

Model Selection

AspectDetail
Task nameseed_explosion
Default tierdefault (cost-optimized for bulk seeding)
Can escalateYes, if escalation signals present
Model overrideSupported via modelOverride input

Spinoff Discovery

During explosion, the AI also discovers related entities. Each spinoff includes:

FieldExamplePurpose
name"Scottie Pippen"Related entity
suggestedTopicPath"Sports > Basketball > NBA"Where to classify
relationship"Teammate, Chicago Bulls dynasty"Why it's related
priorityScore0.85Processing priority (0-1)

Spinoffs are inserted as new seed_entry_queue rows with:

  • source_type: 'spinoff_discovery'
  • parent_entry_id linking to the source entity
  • Entity links created via insertEntityLink() with connection type 'spinoff'

Spinoffs are not immediately processed — they queue up for the next bulk-enqueue run, allowing human review of what the AI discovered.


Super Facts

What They Are

Super facts are surprising connections between entities that wouldn't surface from individual explosions. They require context across multiple entities to discover.

Examples:

  • "Michael Jordan and Magic Johnson both won their first NBA championship at age 27"
  • "Prince and David Bowie both released landmark albums in 1984"
  • "Three Nobel Prize winners in Physics were born in the same small German town"

How They're Found

After a batch of seed entries completes, the FIND_SUPER_FACTS message triggers findSuperFacts():

  1. Loads 10-50 recently exploded entries from the same topic area
  2. AI analyzes them for cross-entity connections
  3. Returns candidates with connection metadata

Connection Types

TypeExample
shared_eventTwo people at the same historical event
rivalryCompetitive relationship
collaborationWorked together on something notable
temporalSame date/year coincidence
geographicSame origin or location connection
causalOne entity influenced another

Super Fact Record

Each super fact is inserted as a fact_record with:

  • source_type: 'ai_super_fact'
  • status: 'pending_validation'
  • Entries in super_fact_links table linking 2-3 entities
  • Entity links created between all pairs of linked entities

Validation Strategy

Seed-originated facts use the multi_phase validation strategy — the strictest available:

PhaseNameWhat It ChecksCost
1StructuralSchema conformance, type validation, injection detection$0
2ConsistencyInternal contradictions, taxonomy rule violations$0
3Cross-ModelAI adversarial verification (different model than generator)~$0.001
4EvidenceExternal API corroboration (Wikipedia, Wikidata) + AI reasoner~$0.002-0.005

This is stricter than news validation (multi_source) because seed facts have no independent news sources to corroborate. The enrichment data from Step 3 is not reused in validation — the validator independently queries external APIs to avoid circular reasoning.


Utility Scripts

ScriptPurposeWhen to Run
generate-curated-entries.tsAI-generate entity names for a topicStarting a new category
bulk-enqueue.tsDispatch pending entries to workersAfter entity generation
generate-challenge-content.tsBatch challenge generationAfter validation completes
seed-from-files.tsParse XLSX/DOCX/CSV and seedBulk import from spreadsheets
cleanup-content.tsRewrite titles/contextFixing quality issues
backfill-fact-nulls.tsFill NULL metadata fieldsAfter schema changes
cleanup-seed-queue.tsRemove garbage entriesPeriodic maintenance
deepen-topic-paths.tsAI-reclassify to deeper categoriesAfter adding subcategories
remap-topic-paths.tsBatch fix malformed topic pathsAfter taxonomy changes
materialize-entity-categories.tsCreate depth-2/3 categories from entitiesExpanding taxonomy
regen-voice-pass.tsRegenerate voice-pass contentAfter voice rule changes
rewrite-challenge-defects.tsFix challenge quality defectsQA remediation

Cost Model

ComponentPer Entity (high tier)Per Entity (medium tier)
Explosion AI call~$0.05-0.15~$0.02-0.05
Validation (per fact)~$0.003~$0.003
Challenge content (per fact)~$0.006~$0.006
Enrichment APIs$0 (all free)$0
Total (50-100 facts)~$0.50-1.00~$0.20-0.50

Super fact discovery adds ~$0.02 per batch of 10-50 entries.


Real-World Example: Seeding "Michael Jordan"

Input

name: "Michael Jordan"
topic_path: "Sports > Basketball > NBA"
richness_tier: "high"
aliases: ["MJ", "Air Jordan", "His Airness"]

Enrichment (automatic)

  • Knowledge Graph: NBA player, born 1963, 6x champion, resultScore: 2847
  • Wikidata: 200+ sitelinks, career stats, team history, awards
  • Wikipedia: 4-paragraph summary with career highlights
  • TheSportsDB: Team badge URLs, player photo, league metadata

Deterministic notability

  • KG resultScore (2847) > 500 ✓
  • Wikidata sitelinks (200+) > 20 ✓
  • Override: notability = 0.9, method = kg_wikidata_bypass

Explosion output (abbreviated)

Facts (75 generated):

  • "Michael Jordan scored 63 points against the Celtics in the 1986 playoffs — a record that still stands"
  • "Jordan was cut from his high school varsity team as a sophomore at Laney High School in Wilmington, NC"
  • "His six NBA Finals appearances resulted in six championships — a perfect 6-0 Finals record"
  • ... (72 more)

Spinoffs (7 discovered):

  • Scottie Pippen (teammate, priority: 0.92)
  • Phil Jackson (coach, priority: 0.88)
  • Larry Bird (rival, priority: 0.85)
  • Space Jam (cultural crossover, priority: 0.72)
  • Nike Air Jordan (business impact, priority: 0.70)
  • Dean Smith (college coach, priority: 0.65)
  • Kobe Bryant (spiritual successor, priority: 0.60)

Super fact candidates (3):

  • Jordan and Magic Johnson: championship age coincidence
  • Jordan and Larry Bird: playoff rivalry record
  • Jordan and Scottie Pippen: combined scoring stats

What happens next

  1. 75 facts → IMPORT_FACTSfact_records (source_type: file_seed)
  2. 75 × VALIDATE_FACT messages enqueued (multi_phase strategy)
  3. 7 spinoff entries → seed_entry_queue (status: pending, awaiting next bulk-enqueue)
  4. After validation: RESOLVE_IMAGE + GENERATE_CHALLENGE_CONTENT per fact
  5. ~60-70 facts survive validation → appear in feed within hours

Key Files

FilePurpose
packages/ai/src/seed-explosion.tsexplodeCategoryEntry() + findSuperFacts()
packages/ai/src/enrichment.tsEnrichment orchestrator (8 free APIs)
apps/worker-facts/src/handlers/explode-entry.tsEXPLODE_CATEGORY_ENTRY handler
apps/worker-facts/src/handlers/import-facts.tsIMPORT_FACTS handler + validation strategy selection
apps/worker-facts/src/handlers/find-super-facts.tsFIND_SUPER_FACTS handler
scripts/seed/generate-curated-entries.tsEntity name generation
scripts/seed/bulk-enqueue.tsQueue dispatch for pending entries
packages/shared/src/schemas.tsZod schemas for queue messages
packages/config/src/index.tsConfig and thresholds
docs/projects/seeding/SEED.mdOperational directives for seeding