Seeding Pipeline — Detailed Flow

How Eko bootstraps new topic areas by exploding named entities into structured facts, discovering related entities, and finding cross-entity connections.

Overview

The seeding pipeline is the fastest way to populate a new topic category with high-quality content. Instead of waiting for news articles to trickle in, a single seed entry like "Michael Jordan" can produce 50-100 verified fact cards in one pass — plus automatically discover related entities (Scottie Pippen, Phil Jackson) that feed the next round.

Three source types flow through this pipeline:

Source Type	What It Produces	Trigger
`file_seed`	Structured facts exploded from a named entity	Manual seed entry creation
`spinoff_discovery`	New seed entries discovered during explosion	Automatic (side effect of explosion)
`ai_super_fact`	Cross-entity correlation facts	Batch completion trigger

All three converge into the same validation → image → challenge content path as news and evergreen facts.

End-to-End Flow

  Content team creates seed entry
  (name: "Prince", topic: "Music > Artists > Pop", richness: "high")
         │
         ▼
  ┌─ STEP 1: ENTITY GENERATION ──────────────────────────────────────┐
  │  Script: generate-curated-entries.ts                              │
  │  AI generates entity names for a topic category                   │
  │  Inserts into seed_entry_queue (status: pending)                  │
  │  Optional: manual creation via admin dashboard                    │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ STEP 2: QUEUE DISPATCH ──────────────────────────────────────────┐
  │  Script: bulk-enqueue.ts                                          │
  │  Reads pending seed_entry_queue rows                              │
  │  Enqueues EXPLODE_CATEGORY_ENTRY per entry                        │
  └──────────────────────────────────────────────────┬────────────────┘
         │
         ▼
  ┌─ STEP 3: EXPLOSION ──────────────────────────────────────────────┐
  │  Worker: worker-facts                                             │
  │  Handler: explode-entry.ts → explodeCategoryEntry()               │
  │                                                                   │
  │  Inputs:                                                          │
  │  - Entity name + aliases                                          │
  │  - Topic path + schema keys                                       │
  │  - Richness tier → fact count range                               │
  │  - Existing titles (for deduplication)                             │
  │  - Entity context from enrichment APIs (optional)                 │
  │                                                                   │
  │  Outputs per entity:                                              │
  │  - 10-100 structured facts (ExplodedFact[])                       │
  │  - 3-10 spinoff candidates (SpinoffCandidate[])                   │
  │  - 0-5 super fact candidates (SuperFactCandidate[])               │
  │  - Discovered aliases                                             │
  └────────────┬──────────────┬──────────────────────┬────────────────┘
               │              │                      │
               ▼              ▼                      ▼
  ┌─ STEP 4a ──────┐  ┌─ STEP 4b ──────┐  ┌─ STEP 4c ──────────────┐
  │ IMPORT_FACTS    │  │ Spinoff insert │  │ Super fact candidates   │
  │ Batch import    │  │ New seed_entry │  │ stored for batch        │
  │ → fact_records  │  │ rows queued    │  │ FIND_SUPER_FACTS later  │
  │ (file_seed)     │  │ for future     │  │                        │
  │                 │  │ explosion      │  │                        │
  └───────┬────────┘  └────────────────┘  └────────────────────────┘
          │
          ▼
  ┌─ STEP 5: VALIDATION ─────────────────────────────────────────────┐
  │  VALIDATE_FACT enqueued per fact (strategy: multi_phase)          │
  │  4-phase: structural → consistency → cross-model → evidence      │
  │  Stricter than news — no independent sources to corroborate      │
  └──────────────────────────────────────────────────┬────────────────┘
          │
          ▼
  ┌─ STEP 6: POST-VALIDATION FAN-OUT ───────────────────────────────┐
  │  RESOLVE_IMAGE (parallel) — Wikipedia → SportsDB → Unsplash     │
  │  GENERATE_CHALLENGE_CONTENT (parallel) — 6 quiz styles          │
  └──────────────────────────────────────────────────┬────────────────┘
          │
          ▼
     Fact appears in feed

Step 1: Entity Generation

How Entities Are Created

Seed entries can be created through multiple paths:

Method	Script/UI	When to Use
AI generation	`scripts/seed/generate-curated-entries.ts`	Bootstrapping a new topic category
Manual entry	Admin dashboard → Pipeline → Seed	Adding specific entities
File import	`scripts/seed/seed-from-files.ts`	Bulk import from XLSX/DOCX/CSV
Spinoff discovery	Automatic (during explosion)	Expanding from existing entities

Seed Entry Record

Each seed entry in seed_entry_queue contains:

Field	Example	Purpose
`name`	"Michael Jordan"	Entity to explode
`topic_path`	"Sports > Basketball > NBA"	Topic classification
`richness_tier`	"high"	Controls fact count range
`aliases`	["MJ", "Air Jordan", "His Airness"]	Alternative names for dedup
`status`	"pending"	Processing state
`source_type`	"manual" or "spinoff_discovery"	How it entered the system
`parent_entry_id`	UUID (nullable)	Links spinoffs to parent

Step 3: Explosion — The Core AI Function

Richness Tiers

The richness tier determines how many facts the AI generates per entity:

Tier	Fact Range	Typical Topics
`high`	50-100	Entertainment, sports, well-known people
`medium`	20-50	Geography, science, animals
`low`	10-20	Business, design, fashion

Enrichment Context

Before calling the AI, the handler optionally resolves enrichment context from external APIs. This gives the AI grounded data to work with rather than relying purely on its training data.

Always queried (parallel):

Source	What It Provides
Google Knowledge Graph	Entity type, description, official URLs, resultScore
Wikidata	Structured properties, sitelink count, identifiers
Wikipedia	Summary paragraph, key facts

Topic-routed (conditional):

Topic Path	Source	What It Provides
`sports/*`	TheSportsDB	Team badges, player photos, league data
`music/*`	MusicBrainz	Discography, genre, active years
`geography/*`	Nominatim	Coordinates, administrative boundaries
`books/*`	Open Library	Publication data, author info, ISBNs

All enrichment calls use Promise.allSettled() — a failing API never blocks the explosion. The merged context string is injected into the AI prompt as grounding data.

Deterministic Notability

When both Knowledge Graph and Wikidata return results, the system can compute a deterministic notability score that overrides the AI's assessment:

KG resultScore > 500 + Wikidata sitelinkCount > 20 → notability 0.9
KG resultScore > 100 + Wikidata found → notability 0.8
Otherwise → AI's own notability score is used

This prevents the AI from underscoring well-known entities or overscoring obscure ones.

AI Prompt Composition

The explosion prompt includes:

CHALLENGE_TONE_PREFIX — theatrical, cinematic title requirements
Taxonomy voice — domain-specific register and energy (via resolveVoice())
Domain vocabulary — expert language patterns (via formatVocabularyForPrompt())
Taxonomy content rules — formatting conventions (via resolveContentRules())
Schema keys — the JSONB fields the AI must produce
Existing titles — for deduplication (up to 50 titles)
Enrichment context — grounding data from external APIs

Model Selection

Aspect	Detail
Task name	`seed_explosion`
Default tier	`default` (cost-optimized for bulk seeding)
Can escalate	Yes, if escalation signals present
Model override	Supported via `modelOverride` input

Spinoff Discovery

During explosion, the AI also discovers related entities. Each spinoff includes:

Field	Example	Purpose
`name`	"Scottie Pippen"	Related entity
`suggestedTopicPath`	"Sports > Basketball > NBA"	Where to classify
`relationship`	"Teammate, Chicago Bulls dynasty"	Why it's related
`priorityScore`	0.85	Processing priority (0-1)

Spinoffs are inserted as new seed_entry_queue rows with:

source_type: 'spinoff_discovery'
parent_entry_id linking to the source entity
Entity links created via insertEntityLink() with connection type 'spinoff'

Spinoffs are not immediately processed — they queue up for the next bulk-enqueue run, allowing human review of what the AI discovered.

Super Facts

What They Are

Super facts are surprising connections between entities that wouldn't surface from individual explosions. They require context across multiple entities to discover.

Examples:

"Michael Jordan and Magic Johnson both won their first NBA championship at age 27"
"Prince and David Bowie both released landmark albums in 1984"
"Three Nobel Prize winners in Physics were born in the same small German town"

How They're Found

After a batch of seed entries completes, the FIND_SUPER_FACTS message triggers findSuperFacts():

Loads 10-50 recently exploded entries from the same topic area
AI analyzes them for cross-entity connections
Returns candidates with connection metadata

Connection Types

Type	Example
`shared_event`	Two people at the same historical event
`rivalry`	Competitive relationship
`collaboration`	Worked together on something notable
`temporal`	Same date/year coincidence
`geographic`	Same origin or location connection
`causal`	One entity influenced another

Super Fact Record

Each super fact is inserted as a fact_record with:

source_type: 'ai_super_fact'
status: 'pending_validation'
Entries in super_fact_links table linking 2-3 entities
Entity links created between all pairs of linked entities

Validation Strategy

Seed-originated facts use the multi_phase validation strategy — the strictest available:

Phase	Name	What It Checks	Cost
1	Structural	Schema conformance, type validation, injection detection	$0
2	Consistency	Internal contradictions, taxonomy rule violations	$0
3	Cross-Model	AI adversarial verification (different model than generator)	~$0.001
4	Evidence	External API corroboration (Wikipedia, Wikidata) + AI reasoner	~$0.002-0.005

This is stricter than news validation (multi_source) because seed facts have no independent news sources to corroborate. The enrichment data from Step 3 is not reused in validation — the validator independently queries external APIs to avoid circular reasoning.

Utility Scripts

Script	Purpose	When to Run
`generate-curated-entries.ts`	AI-generate entity names for a topic	Starting a new category
`bulk-enqueue.ts`	Dispatch pending entries to workers	After entity generation
`generate-challenge-content.ts`	Batch challenge generation	After validation completes
`seed-from-files.ts`	Parse XLSX/DOCX/CSV and seed	Bulk import from spreadsheets
`cleanup-content.ts`	Rewrite titles/context	Fixing quality issues
`backfill-fact-nulls.ts`	Fill NULL metadata fields	After schema changes
`cleanup-seed-queue.ts`	Remove garbage entries	Periodic maintenance
`deepen-topic-paths.ts`	AI-reclassify to deeper categories	After adding subcategories
`remap-topic-paths.ts`	Batch fix malformed topic paths	After taxonomy changes
`materialize-entity-categories.ts`	Create depth-2/3 categories from entities	Expanding taxonomy
`regen-voice-pass.ts`	Regenerate voice-pass content	After voice rule changes
`rewrite-challenge-defects.ts`	Fix challenge quality defects	QA remediation

Cost Model

Component	Per Entity (high tier)	Per Entity (medium tier)
Explosion AI call	~$0.05-0.15	~$0.02-0.05
Validation (per fact)	~$0.003	~$0.003
Challenge content (per fact)	~$0.006	~$0.006
Enrichment APIs	$0 (all free)	$0
Total (50-100 facts)	~$0.50-1.00	~$0.20-0.50

Super fact discovery adds ~$0.02 per batch of 10-50 entries.

Real-World Example: Seeding "Michael Jordan"

Input

name: "Michael Jordan"
topic_path: "Sports > Basketball > NBA"
richness_tier: "high"
aliases: ["MJ", "Air Jordan", "His Airness"]

Enrichment (automatic)

Knowledge Graph: NBA player, born 1963, 6x champion, resultScore: 2847
Wikidata: 200+ sitelinks, career stats, team history, awards
Wikipedia: 4-paragraph summary with career highlights
TheSportsDB: Team badge URLs, player photo, league metadata

Deterministic notability

KG resultScore (2847) > 500 ✓
Wikidata sitelinks (200+) > 20 ✓
Override: notability = 0.9, method = kg_wikidata_bypass

Explosion output (abbreviated)

Facts (75 generated):

"Michael Jordan scored 63 points against the Celtics in the 1986 playoffs — a record that still stands"
"Jordan was cut from his high school varsity team as a sophomore at Laney High School in Wilmington, NC"
"His six NBA Finals appearances resulted in six championships — a perfect 6-0 Finals record"
... (72 more)

Spinoffs (7 discovered):

Scottie Pippen (teammate, priority: 0.92)
Phil Jackson (coach, priority: 0.88)
Larry Bird (rival, priority: 0.85)
Space Jam (cultural crossover, priority: 0.72)
Nike Air Jordan (business impact, priority: 0.70)
Dean Smith (college coach, priority: 0.65)
Kobe Bryant (spiritual successor, priority: 0.60)

Super fact candidates (3):

Jordan and Magic Johnson: championship age coincidence
Jordan and Larry Bird: playoff rivalry record
Jordan and Scottie Pippen: combined scoring stats

What happens next

75 facts → IMPORT_FACTS → fact_records (source_type: file_seed)
75 × VALIDATE_FACT messages enqueued (multi_phase strategy)
7 spinoff entries → seed_entry_queue (status: pending, awaiting next bulk-enqueue)
After validation: RESOLVE_IMAGE + GENERATE_CHALLENGE_CONTENT per fact
~60-70 facts survive validation → appear in feed within hours

Key Files

File	Purpose
`packages/ai/src/seed-explosion.ts`	`explodeCategoryEntry()` + `findSuperFacts()`
`packages/ai/src/enrichment.ts`	Enrichment orchestrator (8 free APIs)
`apps/worker-facts/src/handlers/explode-entry.ts`	EXPLODE_CATEGORY_ENTRY handler
`apps/worker-facts/src/handlers/import-facts.ts`	IMPORT_FACTS handler + validation strategy selection
`apps/worker-facts/src/handlers/find-super-facts.ts`	FIND_SUPER_FACTS handler
`scripts/seed/generate-curated-entries.ts`	Entity name generation
`scripts/seed/bulk-enqueue.ts`	Queue dispatch for pending entries
`packages/shared/src/schemas.ts`	Zod schemas for queue messages
`packages/config/src/index.ts`	Config and thresholds
`docs/projects/seeding/SEED.md`	Operational directives for seeding

Fact Ingestion — Source of Truth Map — SOT references for all three pipelines
News & Fact Engine — System reference
Evergreen Pipeline — AI-generated timeless facts
News Pipeline — Current events ingestion
Fact-Challenge Anatomy — How facts become challenges

#Seeding Pipeline — Detailed Flow

#Overview

#End-to-End Flow

#Step 1: Entity Generation

#How Entities Are Created

#Seed Entry Record

#Step 3: Explosion — The Core AI Function

#Richness Tiers

#Enrichment Context

#Deterministic Notability

#AI Prompt Composition

#Model Selection

#Spinoff Discovery

#Super Facts

#What They Are

#How They're Found

#Connection Types

#Super Fact Record

#Validation Strategy

#Utility Scripts

#Cost Model

#Real-World Example: Seeding "Michael Jordan"

#Input

#Enrichment (automatic)

#Deterministic notability

#Explosion output (abbreviated)

#What happens next

#Key Files

#Related