Seed Pipeline Model Evaluation

Archived (Feb 21 2026): This evaluation led to a temporary xAI Grok integration that has since been removed. The pipeline now uses gpt-5-mini (default) with per-model prompt optimization via the ModelAdapter pattern. Available models: gpt-5-mini, gemini-2.5-flash, gemini-3-flash-preview, claude-haiku-4-5. The content below is preserved as a historical record of the Feb 17 evaluation.

Comparison of LLM options for fact explosion and challenge title generation. The seed pipeline requires structured output (Zod schema via Vercel AI SDK) with creative, specific, theatrical titles that match Eko's voice.

Current State (Feb 17 2026)

  • 144K facts generated across 10 topics from ~6K completed entries
  • gpt-5-nano used for initial bulk generation (cheap but generic titles)
  • gpt-5-mini used for title improvement pass and gen-2 explosions
  • Title quality is still inconsistent: many titles are vague ("Musical Fusion", "Cultural Preservation") rather than specific and cinematic
  • OpenAI monthly quota hit at ~$33 of $120 budget after 32.7M tokens

Requirements

RequirementWeight
Structured output (JSON schema)Must-have
Vercel AI SDK provider supportMust-have
Theatrical, specific challenge titlesHigh
Low cost per million output tokensHigh
Reasoning capability (for specificity)Medium
Large context windowLow

Model Comparison

ModelProviderInput $/1MOutput $/1MBlended*Structured OutputSDK Provider
gpt-5-nanoOpenAI$0.05$0.40~$0.30Yes@ai-sdk/openai (native)
gpt-5-miniOpenAI$0.25$2.00~$1.50Yes@ai-sdk/openai (native)
Grok 4.1 FastxAI$0.20$0.50~$0.40Yes@ai-sdk/xai (first-party)
Grok 4.1 Fast ReasoningxAI$0.20$0.50~$0.40Yes@ai-sdk/xai (first-party)
GLM-4.7Zhipu/Z.AI$0.40$1.50~$1.10Yeszhipu-ai-provider (community)
Grok 4xAI$3.00$15.00~$12.00Yes@ai-sdk/xai (first-party)
Claude Sonnet 4.5Anthropic$3.00$15.00~$12.00Yes@ai-sdk/anthropic (native)

Blended cost assumes ~3:1 output:input token ratio, typical for fact explosion.

Analysis

Best value for Eko's use case.

  • 4x cheaper on output than gpt-5-mini ($0.50 vs $2.00) -- output is ~75% of token spend
  • Reasoning variant "thinks before generating" which should produce more specific, theatrical titles
  • Scores 64/65 quality benchmark (near Grok 4 / o3 level) at 1/15th the Grok 4 price
  • 2M context window (not critical for explosion but useful for super-facts)
  • @ai-sdk/xai is a first-party Vercel AI SDK provider
  • xAI offers $25 free credits on signup + $150/month via data sharing program

Projected cost for remaining 52K entries: $5-8 (vs $20-25 with gpt-5-mini)

GLM-4.7

Solid alternative, but pricier than Grok.

  • Strong at structured output and agent workflows
  • 203K context window
  • Output pricing ($1.50/M) is 3x Grok 4.1 Fast ($0.50/M)
  • Community SDK provider (less battle-tested than first-party)
  • Best suited for coding/agent tasks rather than creative fact generation

gpt-5-mini (Current)

Acceptable quality but expensive for bulk.

  • Proven to work with the pipeline
  • Output at $2.00/M is the most expensive option in the "cheap" tier
  • Title quality improved over nano but still produces generic titles
  • Shares quota with other OpenAI usage (hit $120/month cap)

Integration Path

Adding xAI Grok requires:

  1. bun add @ai-sdk/xai in packages/ai
  2. Add XAI_API_KEY to environment config
  3. Add grok-4-1-fast-reasoning and grok-4-1-fast-non-reasoning to model registry
  4. Update ai_model_tier_config DB table: SET model = 'grok-4-1-fast-reasoning' WHERE tier = 'default'
  5. Test batch of 10 entries to validate structured output quality

Cost Projections

For the remaining ~52K pending entries (at ~434 tokens/fact, ~15 facts/entry):

ModelEst. Total CostQuality
gpt-5-nano$3-5Poor titles, fast
Grok 4.1 Fast Reasoning$5-8Strong (reasoning step)
GLM-4.7$10-15Good
gpt-5-mini$20-25Acceptable

Decision Log

DateDecisionRationale
2026-02-17Start with gpt-5-nano for bulkMinimize cost for initial corpus
2026-02-17Switch to gpt-5-mini for qualityNano titles too generic for platform voice
2026-02-17Evaluate Grok 4.1 FastOpenAI quota hit; need cheaper + better quality
2026-02-17Integrate xAI Grok as provider4x cheaper output, reasoning for specificity, first-party SDK
2026-02-17Set ALL tiers to grok-4-1-fast-reasoningSingle provider simplifies routing, eliminates cross-provider inconsistencies
2026-02-17Full corpus cleanup (not just titles)Context field too sparse; notability scores missing; holistic rewrite needed

References