Gemini Batch API Cost Optimization Strategy

How Eko can leverage Google's Gemini Batch API to reduce AI costs by 50% on eligible pipeline stages, without changing quality or validation guarantees.

Executive Summary

Google's Gemini Batch API offers 50% cost reduction on all token pricing for asynchronous workloads. Eko's fact engine pipeline is predominantly queue-driven and offline, making it an ideal candidate. At current volumes ($14-15/day), batch processing could save $2,500+/year while maintaining identical output quality.

Current Architecture

How Eko Calls Gemini Today

All AI calls use the Vercel AI SDK (@ai-sdk/google) via generateObject():

Queue Message → Worker Handler → selectModelForTask(task)
    → resolveModelTier(tier) → createLanguageModel(resolved)
    → generateObject({ model, schema, system, prompt })
    → recordCost() → DB update

Every call is synchronous and one-at-a-time. The worker processes a single queue message, makes one API call to Gemini, waits for the response, then moves to the next message.

Current Model Routing

All three tiers currently resolve to Gemini 2.5 Flash:

TierModelToken Pricing (per MTok)
defaultgemini-2.5-flash$0.15 input / $0.60 output
midgemini-2.5-flash$0.15 input / $0.60 output
highgemini-2.5-flash$0.15 input / $0.60 output

Task-to-Tier Mapping

TaskTierVolumeLatency Tolerance
notability_scoringdefaultHighHours
story_summarydefaultMediumHours
fact_extractionmidMediumHours
fact_validationmidMediumHours (gates publication)
evergreen_generationmidLow-MediumHours (daily cron)
challenge_content_generationdefaultHighHours
seed_explosiondefaultHigh (bursts)Days
super_fact_discoverydefaultMedium (bursts)Days
content_cleanupdefaultMedium (bursts)Days
entity_classificationdefaultHighHours
conversational_turndefaultHighSeconds (user-facing)
text_answer_moderationdefaultHighSeconds (user-facing)
text_answer_scoringdefaultHighSeconds (user-facing)
dispute_evaluationmidLowSeconds (user-facing)

Gemini Batch API Overview

How It Works

Instead of individual synchronous API calls, the Batch API accepts a collection of requests and processes them asynchronously:

  1. Submit a batch of GenerateContentRequest objects (inline or via JSONL file upload)
  2. Poll for job completion (typically minutes, SLO is 24 hours)
  3. Retrieve all results at once

Pricing

All batch requests are priced at 50% of standard rates:

ModelStandard (per MTok)Batch (per MTok)
gemini-2.5-flash input$0.15$0.075
gemini-2.5-flash output$0.60$0.30

Submission Methods

MethodMax SizeBest For
Inline requests< 20 MB total10-50 requests per batch
JSONL file uploadUp to 2 GBHundreds to thousands of requests

Key Features

  • Structured output supported via response_mime_type: 'application/json' + response_schema
  • Context caching enabled for batch requests (shared system prompts get cached token pricing)
  • Per-request configuration (temperature, system instructions, tools) can vary within a batch
  • System instructions can be set per-request, enabling different adapter prefixes per task type

SDK Requirement

The Batch API uses the @google/genai SDK, not the Vercel AI SDK (@ai-sdk/google). Both can coexist in the same project using the same GOOGLE_API_KEY.

import { GoogleGenAI } from '@google/genai'
const ai = new GoogleGenAI({})  // Uses GOOGLE_API_KEY from env

const batchJob = await ai.batches.create({
  model: 'gemini-2.5-flash',
  src: inlineRequests,
  config: { displayName: 'fact-extraction-batch-42' },
})

Eligibility Analysis

Batch-Eligible (No User-Facing Latency)

These tasks are fully offline, queue-driven, and tolerate minutes-to-hours of delay:

TaskCurrent Call PatternBatch StrategySavings Potential
seed_explosion1 entry per callJSONL file (hundreds per seeding run)50%
super_fact_discovery1 call per batchInline (5-20 per run)50%
evergreen_generation1 topic per callInline (20 topics per daily cron)50%
content_cleanup1 fact per callJSONL file (bulk rewrite batches)50%
notability_scoring1 fact per callInline (20-50 per batch)50%
entity_classification1 entry per callInline (20-50 per batch)50%
challenge_content_generation1 fact per call (supports array)Inline (10-30 per batch)50%
fact_extraction1 story per callInline (5-15 stories per ingestion run)50%

Conditionally Eligible

TaskConcernRecommendation
fact_validationGates publication; batch adds hours of delayBatch only for seeding runs (not live ingestion)

Not Eligible

TaskReason
conversational_turnUser-facing, requires sub-second response
text_answer_moderationReal-time safety gate during user interaction
text_answer_scoringReal-time feedback during user interaction
dispute_evaluationUser-initiated, expects prompt resolution

Projected Cost Savings

Daily Cost Model (Current vs. Batch)

Based on the $14.47 test day (Feb 23, 2026):

ScenarioBatch CoverageDaily CostMonthly CostAnnual Cost
Current (all synchronous)0%$14.47~$434~$5,280
Conservative (seeding + evergreen only)~30%$12.30~$369~$4,490
Moderate (all offline tasks)~70%$9.41~$282~$3,430
Aggressive (everything except real-time)~90%$7.96~$239~$2,900

Conservative annual savings: ~$790. Aggressive: ~$2,380.

These savings scale linearly with volume increases (e.g., doubling ingestion doubles savings).

Context Caching Bonus

Eko's pipeline uses large, shared system prompts per task type (the GEMINI_FACT_PREFIX is 63 lines, ~2,500 tokens). When multiple requests in a batch share the same system instruction, Google applies context caching automatically:

  • Cached input tokens are priced at 75% off standard rates
  • Combined with batch discount: effectively 87.5% off standard input pricing for cached tokens

This amplifies savings on high-volume tasks like notability_scoring and challenge_content_generation where every request shares the same system prompt.

Implementation Architecture

Approach: Batch Accumulator Layer

Add a thin layer between queue consumption and AI calls. The existing worker architecture stays intact.

                    ┌─────────────────────────────────────────┐
                    │         Existing Pipeline (unchanged)    │
                    │                                         │
Queue Messages ────►│  Worker Handler                         │
                    │       │                                 │
                    │       ▼                                 │
                    │  selectModelForTask(task)               │
                    │       │                                 │
                    │       ├── Real-time task? ──► generateObject() (sync, as today)
                    │       │                                 │
                    │       └── Batch-eligible? ──► batchAccumulator.add(request)
                    │                                    │    │
                    └────────────────────────────────────│────┘
                                                        │
                                              ┌─────────▼──────────┐
                                              │  Batch Accumulator  │
                                              │                     │
                                              │  Collects requests   │
                                              │  until:              │
                                              │  - count >= threshold │
                                              │  - age >= maxWaitMs   │
                                              │                     │
                                              │  Then submits batch  │
                                              │  via @google/genai   │
                                              └─────────┬───────────┘
                                                        │
                                              ┌─────────▼───────────┐
                                              │  Batch Job Poller    │
                                              │                     │
                                              │  Polls job status    │
                                              │  On completion:      │
                                              │  - Parse results     │
                                              │  - Route to handlers │
                                              │  - Record costs      │
                                              │  - Update DB         │
                                              └─────────────────────┘

Configuration

ParameterDefaultPurpose
BATCH_ENABLEDfalseFeature flag to enable batch processing
BATCH_MIN_SIZE5Minimum requests before submitting a batch
BATCH_MAX_WAIT_MS60_000Maximum time to accumulate before flushing
BATCH_MAX_SIZE100Maximum requests per batch (inline limit)
BATCH_FILE_THRESHOLD50Switch from inline to JSONL file above this count

Schema Mapping

Eko uses Zod schemas via Vercel AI SDK's generateObject(). The Batch API requires JSON Schema format. The mapping is straightforward:

import { zodToJsonSchema } from 'zod-to-json-schema'

// Current: Vercel AI SDK
generateObject({ model, schema: factExtractionSchema, ... })

// Batch equivalent: @google/genai
{
  contents: [{ parts: [{ text: prompt }], role: 'user' }],
  config: {
    systemInstruction: { parts: [{ text: systemPrompt }] },
    responseMimeType: 'application/json',
    responseSchema: zodToJsonSchema(factExtractionSchema),
    temperature: 0.5,
  }
}

Cost Tracking Integration

The existing recordCost() function works unchanged. Costs are recorded when batch results arrive:

for (const response of batchJob.dest.inlinedResponses) {
  const usage = response.response.usageMetadata
  await recordCost({
    model: 'gemini-2.5-flash',
    feature: taskType,
    inputTokens: usage.promptTokenCount,
    outputTokens: usage.candidatesTokenCount,
  })
}

The estimateCost() function in cost-tracker.ts should be updated to apply the 50% batch discount when the call source is a batch job, so the ai_cost_log table accurately reflects actual spend.

Implementation Phases

Phase 1: Seeding Pipeline (Lowest Risk, Immediate Value)

Scope: seed_explosion, super_fact_discovery, content_cleanup

These are fully offline, run in manual or scheduled bursts, and have no user-facing latency requirements. Perfect for proving out the batch integration.

Changes:

  1. Add @google/genai SDK dependency to packages/ai
  2. Create packages/ai/src/batch-client.ts — thin wrapper around GoogleGenAI.batches
  3. Create packages/ai/src/batch-accumulator.ts — request collection + flush logic
  4. Modify seed-explosion.ts to optionally route through batch accumulator
  5. Add BATCH_ENABLED feature flag to @eko/config
  6. Update cost-tracker.ts to support batch discount in estimates

Risk: Zero impact on live pipeline. Seeding is developer-initiated.

Phase 2: Daily Cron Tasks

Scope: evergreen_generation, notability_scoring, entity_classification

These run on the daily 3 AM UTC cron. The cron route already dispatches all topics as separate queue messages — instead, it could collect them into a single batch.

Changes:

  1. Modify generate-evergreen cron route to support batch dispatch mode
  2. Add batch result polling to worker-facts
  3. Extend batch accumulator with JSONL file support for larger batches

Phase 3: Ingestion Pipeline

Scope: fact_extraction, challenge_content_generation

These are the highest-volume tasks. Fact extraction runs every 15 minutes (news cron), and challenge generation fans out after every validated fact.

Changes:

  1. Add batch accumulation to worker-facts extraction handler
  2. Add batch accumulation to challenge content handler
  3. Handle partial failures (some requests in a batch may fail while others succeed)

Phase 4: Validation (Selective)

Scope: fact_validation (seeding runs only)

Live ingestion validation stays synchronous for fastest time-to-feed. Seeding runs, which generate hundreds of facts at once, route validation through batch.

Changes:

  1. Add batchMode flag to VALIDATE_FACT queue message schema
  2. Worker-validate checks flag and routes accordingly

Risks and Mitigations

RiskImpactMitigation
Batch job failureFacts stuck in pending stateAutomatic fallback to synchronous on batch failure; dead-letter queue for failed items
Latency varianceBatch SLO is 24h (typically minutes)Time-box polling; escalate to sync after 30 min
SDK compatibility@google/genai may conflict with @ai-sdk/googleBoth use REST API with same key; tested coexistence in Node
Schema translationZod → JSON Schema may lose edge casesValidate round-trip: generate via batch, validate output against Zod schema
Partial batch failureSome requests succeed, others failProcess successful results; re-enqueue failed items as individual sync calls
Cost tracking accuracyBatch discount not reflected in estimatesUpdate estimateCost() with batchDiscount parameter

Dependencies

DependencyVersionPurpose
@google/genailatestBatch API client (submit, poll, retrieve)
zod-to-json-schemalatestConvert Zod schemas to JSON Schema for batch requests

Decision Log

DecisionRationale
Use @google/genai SDK (not REST directly)Official SDK handles auth, polling, file upload; matches Batch API docs
Start with seeding pipelineZero risk to live users; highest batch density; proves integration
Keep synchronous path as fallbackFeature flag + automatic fallback ensures no regression
Inline requests before JSONLSimpler; sufficient for batches under 50 requests
Don't batch conversational_turnUser-facing latency requirement incompatible with async processing

Key Files

FileRole
packages/ai/src/model-router.tsModel selection and instantiation (add batch routing)
packages/ai/src/fact-engine.tsAI functions (add batch dispatch option)
packages/ai/src/cost-tracker.tsCost recording (add batch discount)
packages/ai/src/challenge-content.tsChallenge generation (batch candidate)
packages/ai/src/seed-explosion.tsSeeding pipeline (Phase 1 target)
packages/ai/src/models/adapters/gemini-2.5-flash.tsPrompt prefixes (reused in batch requests)
packages/config/src/index.tsFeature flags and env config
apps/worker-facts/src/index.tsQueue consumer (add batch accumulation)
apps/worker-validate/src/index.tsValidation queue (Phase 4 selective batching)