Gemini Batch API Cost Optimization Strategy

How Eko can leverage Google's Gemini Batch API to reduce AI costs by 50% on eligible pipeline stages, without changing quality or validation guarantees.

Executive Summary

Google's Gemini Batch API offers 50% cost reduction on all token pricing for asynchronous workloads. Eko's fact engine pipeline is predominantly queue-driven and offline, making it an ideal candidate. At current volumes ($14-15/day), batch processing could save $2,500+/year while maintaining identical output quality.

Current Architecture

How Eko Calls Gemini Today

All AI calls use the Vercel AI SDK (@ai-sdk/google) via generateObject():

Queue Message → Worker Handler → selectModelForTask(task)
    → resolveModelTier(tier) → createLanguageModel(resolved)
    → generateObject({ model, schema, system, prompt })
    → recordCost() → DB update

Every call is synchronous and one-at-a-time. The worker processes a single queue message, makes one API call to Gemini, waits for the response, then moves to the next message.

Current Model Routing

All three tiers currently resolve to Gemini 2.5 Flash:

Tier	Model	Token Pricing (per MTok)
default	gemini-2.5-flash	$0.15 input / $0.60 output
mid	gemini-2.5-flash	$0.15 input / $0.60 output
high	gemini-2.5-flash	$0.15 input / $0.60 output

Task-to-Tier Mapping

Task	Tier	Volume	Latency Tolerance
`notability_scoring`	default	High	Hours
`story_summary`	default	Medium	Hours
`fact_extraction`	mid	Medium	Hours
`fact_validation`	mid	Medium	Hours (gates publication)
`evergreen_generation`	mid	Low-Medium	Hours (daily cron)
`challenge_content_generation`	default	High	Hours
`seed_explosion`	default	High (bursts)	Days
`super_fact_discovery`	default	Medium (bursts)	Days
`content_cleanup`	default	Medium (bursts)	Days
`entity_classification`	default	High	Hours
`conversational_turn`	default	High	Seconds (user-facing)
`text_answer_moderation`	default	High	Seconds (user-facing)
`text_answer_scoring`	default	High	Seconds (user-facing)
`dispute_evaluation`	mid	Low	Seconds (user-facing)

Gemini Batch API Overview

How It Works

Instead of individual synchronous API calls, the Batch API accepts a collection of requests and processes them asynchronously:

Submit a batch of GenerateContentRequest objects (inline or via JSONL file upload)
Poll for job completion (typically minutes, SLO is 24 hours)
Retrieve all results at once

Pricing

All batch requests are priced at 50% of standard rates:

Model	Standard (per MTok)	Batch (per MTok)
gemini-2.5-flash input	$0.15	$0.075
gemini-2.5-flash output	$0.60	$0.30

Submission Methods

Method	Max Size	Best For
Inline requests	< 20 MB total	10-50 requests per batch
JSONL file upload	Up to 2 GB	Hundreds to thousands of requests

Key Features

Structured output supported via response_mime_type: 'application/json' + response_schema
Context caching enabled for batch requests (shared system prompts get cached token pricing)
Per-request configuration (temperature, system instructions, tools) can vary within a batch
System instructions can be set per-request, enabling different adapter prefixes per task type

SDK Requirement

The Batch API uses the @google/genai SDK, not the Vercel AI SDK (@ai-sdk/google). Both can coexist in the same project using the same GOOGLE_API_KEY.

import { GoogleGenAI } from '@google/genai'
const ai = new GoogleGenAI({})  // Uses GOOGLE_API_KEY from env

const batchJob = await ai.batches.create({
  model: 'gemini-2.5-flash',
  src: inlineRequests,
  config: { displayName: 'fact-extraction-batch-42' },
})

Eligibility Analysis

Batch-Eligible (No User-Facing Latency)

These tasks are fully offline, queue-driven, and tolerate minutes-to-hours of delay:

Task	Current Call Pattern	Batch Strategy	Savings Potential
seed_explosion	1 entry per call	JSONL file (hundreds per seeding run)	50%
super_fact_discovery	1 call per batch	Inline (5-20 per run)	50%
evergreen_generation	1 topic per call	Inline (20 topics per daily cron)	50%
content_cleanup	1 fact per call	JSONL file (bulk rewrite batches)	50%
notability_scoring	1 fact per call	Inline (20-50 per batch)	50%
entity_classification	1 entry per call	Inline (20-50 per batch)	50%
challenge_content_generation	1 fact per call (supports array)	Inline (10-30 per batch)	50%
fact_extraction	1 story per call	Inline (5-15 stories per ingestion run)	50%

Conditionally Eligible

Task	Concern	Recommendation
fact_validation	Gates publication; batch adds hours of delay	Batch only for seeding runs (not live ingestion)

Not Eligible

Task	Reason
conversational_turn	User-facing, requires sub-second response
text_answer_moderation	Real-time safety gate during user interaction
text_answer_scoring	Real-time feedback during user interaction
dispute_evaluation	User-initiated, expects prompt resolution

Projected Cost Savings

Daily Cost Model (Current vs. Batch)

Based on the $14.47 test day (Feb 23, 2026):

Scenario	Batch Coverage	Daily Cost	Monthly Cost	Annual Cost
Current (all synchronous)	0%	$14.47	~$434	~$5,280
Conservative (seeding + evergreen only)	~30%	$12.30	~$369	~$4,490
Moderate (all offline tasks)	~70%	$9.41	~$282	~$3,430
Aggressive (everything except real-time)	~90%	$7.96	~$239	~$2,900

Conservative annual savings: ~$790. Aggressive: ~$2,380.

These savings scale linearly with volume increases (e.g., doubling ingestion doubles savings).

Context Caching Bonus

Eko's pipeline uses large, shared system prompts per task type (the GEMINI_FACT_PREFIX is 63 lines, ~2,500 tokens). When multiple requests in a batch share the same system instruction, Google applies context caching automatically:

Cached input tokens are priced at 75% off standard rates
Combined with batch discount: effectively 87.5% off standard input pricing for cached tokens

This amplifies savings on high-volume tasks like notability_scoring and challenge_content_generation where every request shares the same system prompt.

Implementation Architecture

Approach: Batch Accumulator Layer

Add a thin layer between queue consumption and AI calls. The existing worker architecture stays intact.

                    ┌─────────────────────────────────────────┐
                    │         Existing Pipeline (unchanged)    │
                    │                                         │
Queue Messages ────►│  Worker Handler                         │
                    │       │                                 │
                    │       ▼                                 │
                    │  selectModelForTask(task)               │
                    │       │                                 │
                    │       ├── Real-time task? ──► generateObject() (sync, as today)
                    │       │                                 │
                    │       └── Batch-eligible? ──► batchAccumulator.add(request)
                    │                                    │    │
                    └────────────────────────────────────│────┘
                                                        │
                                              ┌─────────▼──────────┐
                                              │  Batch Accumulator  │
                                              │                     │
                                              │  Collects requests   │
                                              │  until:              │
                                              │  - count >= threshold │
                                              │  - age >= maxWaitMs   │
                                              │                     │
                                              │  Then submits batch  │
                                              │  via @google/genai   │
                                              └─────────┬───────────┘
                                                        │
                                              ┌─────────▼───────────┐
                                              │  Batch Job Poller    │
                                              │                     │
                                              │  Polls job status    │
                                              │  On completion:      │
                                              │  - Parse results     │
                                              │  - Route to handlers │
                                              │  - Record costs      │
                                              │  - Update DB         │
                                              └─────────────────────┘

Configuration

Parameter	Default	Purpose
`BATCH_ENABLED`	`false`	Feature flag to enable batch processing
`BATCH_MIN_SIZE`	`5`	Minimum requests before submitting a batch
`BATCH_MAX_WAIT_MS`	`60_000`	Maximum time to accumulate before flushing
`BATCH_MAX_SIZE`	`100`	Maximum requests per batch (inline limit)
`BATCH_FILE_THRESHOLD`	`50`	Switch from inline to JSONL file above this count

Schema Mapping

Eko uses Zod schemas via Vercel AI SDK's generateObject(). The Batch API requires JSON Schema format. The mapping is straightforward:

import { zodToJsonSchema } from 'zod-to-json-schema'

// Current: Vercel AI SDK
generateObject({ model, schema: factExtractionSchema, ... })

// Batch equivalent: @google/genai
{
  contents: [{ parts: [{ text: prompt }], role: 'user' }],
  config: {
    systemInstruction: { parts: [{ text: systemPrompt }] },
    responseMimeType: 'application/json',
    responseSchema: zodToJsonSchema(factExtractionSchema),
    temperature: 0.5,
  }
}

Cost Tracking Integration

The existing recordCost() function works unchanged. Costs are recorded when batch results arrive:

for (const response of batchJob.dest.inlinedResponses) {
  const usage = response.response.usageMetadata
  await recordCost({
    model: 'gemini-2.5-flash',
    feature: taskType,
    inputTokens: usage.promptTokenCount,
    outputTokens: usage.candidatesTokenCount,
  })
}

The estimateCost() function in cost-tracker.ts should be updated to apply the 50% batch discount when the call source is a batch job, so the ai_cost_log table accurately reflects actual spend.

Implementation Phases

Phase 1: Seeding Pipeline (Lowest Risk, Immediate Value)

Scope: seed_explosion, super_fact_discovery, content_cleanup

These are fully offline, run in manual or scheduled bursts, and have no user-facing latency requirements. Perfect for proving out the batch integration.

Changes:

Add @google/genai SDK dependency to packages/ai
Create packages/ai/src/batch-client.ts — thin wrapper around GoogleGenAI.batches
Create packages/ai/src/batch-accumulator.ts — request collection + flush logic
Modify seed-explosion.ts to optionally route through batch accumulator
Add BATCH_ENABLED feature flag to @eko/config
Update cost-tracker.ts to support batch discount in estimates

Risk: Zero impact on live pipeline. Seeding is developer-initiated.

Phase 2: Daily Cron Tasks

Scope: evergreen_generation, notability_scoring, entity_classification

These run on the daily 3 AM UTC cron. The cron route already dispatches all topics as separate queue messages — instead, it could collect them into a single batch.

Changes:

Modify generate-evergreen cron route to support batch dispatch mode
Add batch result polling to worker-facts
Extend batch accumulator with JSONL file support for larger batches

Phase 3: Ingestion Pipeline

Scope: fact_extraction, challenge_content_generation

These are the highest-volume tasks. Fact extraction runs every 15 minutes (news cron), and challenge generation fans out after every validated fact.

Changes:

Add batch accumulation to worker-facts extraction handler
Add batch accumulation to challenge content handler
Handle partial failures (some requests in a batch may fail while others succeed)

Phase 4: Validation (Selective)

Scope: fact_validation (seeding runs only)

Live ingestion validation stays synchronous for fastest time-to-feed. Seeding runs, which generate hundreds of facts at once, route validation through batch.

Changes:

Add batchMode flag to VALIDATE_FACT queue message schema
Worker-validate checks flag and routes accordingly

Risks and Mitigations

Risk	Impact	Mitigation
Batch job failure	Facts stuck in pending state	Automatic fallback to synchronous on batch failure; dead-letter queue for failed items
Latency variance	Batch SLO is 24h (typically minutes)	Time-box polling; escalate to sync after 30 min
SDK compatibility	`@google/genai` may conflict with `@ai-sdk/google`	Both use REST API with same key; tested coexistence in Node
Schema translation	Zod → JSON Schema may lose edge cases	Validate round-trip: generate via batch, validate output against Zod schema
Partial batch failure	Some requests succeed, others fail	Process successful results; re-enqueue failed items as individual sync calls
Cost tracking accuracy	Batch discount not reflected in estimates	Update `estimateCost()` with `batchDiscount` parameter

Dependencies

Dependency	Version	Purpose
`@google/genai`	latest	Batch API client (submit, poll, retrieve)
`zod-to-json-schema`	latest	Convert Zod schemas to JSON Schema for batch requests

Decision Log

Decision	Rationale
Use `@google/genai` SDK (not REST directly)	Official SDK handles auth, polling, file upload; matches Batch API docs
Start with seeding pipeline	Zero risk to live users; highest batch density; proves integration
Keep synchronous path as fallback	Feature flag + automatic fallback ensures no regression
Inline requests before JSONL	Simpler; sufficient for batches under 50 requests
Don't batch `conversational_turn`	User-facing latency requirement incompatible with async processing

Key Files

File	Role
`packages/ai/src/model-router.ts`	Model selection and instantiation (add batch routing)
`packages/ai/src/fact-engine.ts`	AI functions (add batch dispatch option)
`packages/ai/src/cost-tracker.ts`	Cost recording (add batch discount)
`packages/ai/src/challenge-content.ts`	Challenge generation (batch candidate)
`packages/ai/src/seed-explosion.ts`	Seeding pipeline (Phase 1 target)
`packages/ai/src/models/adapters/gemini-2.5-flash.ts`	Prompt prefixes (reused in batch requests)
`packages/config/src/index.ts`	Feature flags and env config
`apps/worker-facts/src/index.ts`	Queue consumer (add batch accumulation)
`apps/worker-validate/src/index.ts`	Validation queue (Phase 4 selective batching)

News & Fact Engine — System reference for the full pipeline
Evergreen Ingestion — Daily cron pipeline (Phase 2 target)
Gemini Batch API Docs — Official documentation
APP-CONTROL.md — Operational manifest (crons, workers, queues)

#Gemini Batch API Cost Optimization Strategy

#Executive Summary

#Current Architecture

#How Eko Calls Gemini Today

#Current Model Routing

#Task-to-Tier Mapping

#Gemini Batch API Overview

#How It Works

#Pricing

#Submission Methods

#Key Features

#SDK Requirement

#Eligibility Analysis

#Batch-Eligible (No User-Facing Latency)

#Conditionally Eligible

#Not Eligible

#Projected Cost Savings

#Daily Cost Model (Current vs. Batch)

#Context Caching Bonus

#Implementation Architecture

#Approach: Batch Accumulator Layer

#Configuration

#Schema Mapping

#Cost Tracking Integration

#Implementation Phases

#Phase 1: Seeding Pipeline (Lowest Risk, Immediate Value)

#Phase 2: Daily Cron Tasks

#Phase 3: Ingestion Pipeline

#Phase 4: Validation (Selective)

#Risks and Mitigations

#Dependencies

#Decision Log

#Key Files

#Related

Gemini Batch API Cost Optimization Strategy

Executive Summary

Current Architecture

How Eko Calls Gemini Today

Current Model Routing

Task-to-Tier Mapping

Gemini Batch API Overview

How It Works

Pricing

Submission Methods

Key Features

SDK Requirement

Eligibility Analysis

Batch-Eligible (No User-Facing Latency)

Conditionally Eligible

Not Eligible

Projected Cost Savings

Daily Cost Model (Current vs. Batch)

Context Caching Bonus

Implementation Architecture

Approach: Batch Accumulator Layer

Configuration

Schema Mapping

Cost Tracking Integration

Implementation Phases

Phase 1: Seeding Pipeline (Lowest Risk, Immediate Value)

Phase 2: Daily Cron Tasks

Phase 3: Ingestion Pipeline

Phase 4: Validation (Selective)

Risks and Mitigations

Dependencies

Decision Log

Key Files

Related