Gemini Batch API Cost Optimization Strategy
How Eko can leverage Google's Gemini Batch API to reduce AI costs by 50% on eligible pipeline stages, without changing quality or validation guarantees.
Executive Summary
Google's Gemini Batch API offers 50% cost reduction on all token pricing for asynchronous workloads. Eko's fact engine pipeline is predominantly queue-driven and offline, making it an ideal candidate. At current volumes ($14-15/day), batch processing could save $2,500+/year while maintaining identical output quality.
Current Architecture
How Eko Calls Gemini Today
All AI calls use the Vercel AI SDK (@ai-sdk/google) via generateObject():
Queue Message → Worker Handler → selectModelForTask(task)
→ resolveModelTier(tier) → createLanguageModel(resolved)
→ generateObject({ model, schema, system, prompt })
→ recordCost() → DB update
Every call is synchronous and one-at-a-time. The worker processes a single queue message, makes one API call to Gemini, waits for the response, then moves to the next message.
Current Model Routing
All three tiers currently resolve to Gemini 2.5 Flash:
| Tier | Model | Token Pricing (per MTok) |
|---|---|---|
| default | gemini-2.5-flash | $0.15 input / $0.60 output |
| mid | gemini-2.5-flash | $0.15 input / $0.60 output |
| high | gemini-2.5-flash | $0.15 input / $0.60 output |
Task-to-Tier Mapping
| Task | Tier | Volume | Latency Tolerance |
|---|---|---|---|
notability_scoring | default | High | Hours |
story_summary | default | Medium | Hours |
fact_extraction | mid | Medium | Hours |
fact_validation | mid | Medium | Hours (gates publication) |
evergreen_generation | mid | Low-Medium | Hours (daily cron) |
challenge_content_generation | default | High | Hours |
seed_explosion | default | High (bursts) | Days |
super_fact_discovery | default | Medium (bursts) | Days |
content_cleanup | default | Medium (bursts) | Days |
entity_classification | default | High | Hours |
conversational_turn | default | High | Seconds (user-facing) |
text_answer_moderation | default | High | Seconds (user-facing) |
text_answer_scoring | default | High | Seconds (user-facing) |
dispute_evaluation | mid | Low | Seconds (user-facing) |
Gemini Batch API Overview
How It Works
Instead of individual synchronous API calls, the Batch API accepts a collection of requests and processes them asynchronously:
- Submit a batch of
GenerateContentRequestobjects (inline or via JSONL file upload) - Poll for job completion (typically minutes, SLO is 24 hours)
- Retrieve all results at once
Pricing
All batch requests are priced at 50% of standard rates:
| Model | Standard (per MTok) | Batch (per MTok) |
|---|---|---|
| gemini-2.5-flash input | $0.15 | $0.075 |
| gemini-2.5-flash output | $0.60 | $0.30 |
Submission Methods
| Method | Max Size | Best For |
|---|---|---|
| Inline requests | < 20 MB total | 10-50 requests per batch |
| JSONL file upload | Up to 2 GB | Hundreds to thousands of requests |
Key Features
- Structured output supported via
response_mime_type: 'application/json'+response_schema - Context caching enabled for batch requests (shared system prompts get cached token pricing)
- Per-request configuration (temperature, system instructions, tools) can vary within a batch
- System instructions can be set per-request, enabling different adapter prefixes per task type
SDK Requirement
The Batch API uses the @google/genai SDK, not the Vercel AI SDK (@ai-sdk/google). Both can coexist in the same project using the same GOOGLE_API_KEY.
import { GoogleGenAI } from '@google/genai'
const ai = new GoogleGenAI({}) // Uses GOOGLE_API_KEY from env
const batchJob = await ai.batches.create({
model: 'gemini-2.5-flash',
src: inlineRequests,
config: { displayName: 'fact-extraction-batch-42' },
})
Eligibility Analysis
Batch-Eligible (No User-Facing Latency)
These tasks are fully offline, queue-driven, and tolerate minutes-to-hours of delay:
| Task | Current Call Pattern | Batch Strategy | Savings Potential |
|---|---|---|---|
| seed_explosion | 1 entry per call | JSONL file (hundreds per seeding run) | 50% |
| super_fact_discovery | 1 call per batch | Inline (5-20 per run) | 50% |
| evergreen_generation | 1 topic per call | Inline (20 topics per daily cron) | 50% |
| content_cleanup | 1 fact per call | JSONL file (bulk rewrite batches) | 50% |
| notability_scoring | 1 fact per call | Inline (20-50 per batch) | 50% |
| entity_classification | 1 entry per call | Inline (20-50 per batch) | 50% |
| challenge_content_generation | 1 fact per call (supports array) | Inline (10-30 per batch) | 50% |
| fact_extraction | 1 story per call | Inline (5-15 stories per ingestion run) | 50% |
Conditionally Eligible
| Task | Concern | Recommendation |
|---|---|---|
| fact_validation | Gates publication; batch adds hours of delay | Batch only for seeding runs (not live ingestion) |
Not Eligible
| Task | Reason |
|---|---|
| conversational_turn | User-facing, requires sub-second response |
| text_answer_moderation | Real-time safety gate during user interaction |
| text_answer_scoring | Real-time feedback during user interaction |
| dispute_evaluation | User-initiated, expects prompt resolution |
Projected Cost Savings
Daily Cost Model (Current vs. Batch)
Based on the $14.47 test day (Feb 23, 2026):
| Scenario | Batch Coverage | Daily Cost | Monthly Cost | Annual Cost |
|---|---|---|---|---|
| Current (all synchronous) | 0% | $14.47 | ~$434 | ~$5,280 |
| Conservative (seeding + evergreen only) | ~30% | $12.30 | ~$369 | ~$4,490 |
| Moderate (all offline tasks) | ~70% | $9.41 | ~$282 | ~$3,430 |
| Aggressive (everything except real-time) | ~90% | $7.96 | ~$239 | ~$2,900 |
Conservative annual savings: ~$790. Aggressive: ~$2,380.
These savings scale linearly with volume increases (e.g., doubling ingestion doubles savings).
Context Caching Bonus
Eko's pipeline uses large, shared system prompts per task type (the GEMINI_FACT_PREFIX is 63 lines, ~2,500 tokens). When multiple requests in a batch share the same system instruction, Google applies context caching automatically:
- Cached input tokens are priced at 75% off standard rates
- Combined with batch discount: effectively 87.5% off standard input pricing for cached tokens
This amplifies savings on high-volume tasks like notability_scoring and challenge_content_generation where every request shares the same system prompt.
Implementation Architecture
Approach: Batch Accumulator Layer
Add a thin layer between queue consumption and AI calls. The existing worker architecture stays intact.
┌─────────────────────────────────────────┐
│ Existing Pipeline (unchanged) │
│ │
Queue Messages ────►│ Worker Handler │
│ │ │
│ ▼ │
│ selectModelForTask(task) │
│ │ │
│ ├── Real-time task? ──► generateObject() (sync, as today)
│ │ │
│ └── Batch-eligible? ──► batchAccumulator.add(request)
│ │ │
└────────────────────────────────────│────┘
│
┌─────────▼──────────┐
│ Batch Accumulator │
│ │
│ Collects requests │
│ until: │
│ - count >= threshold │
│ - age >= maxWaitMs │
│ │
│ Then submits batch │
│ via @google/genai │
└─────────┬───────────┘
│
┌─────────▼───────────┐
│ Batch Job Poller │
│ │
│ Polls job status │
│ On completion: │
│ - Parse results │
│ - Route to handlers │
│ - Record costs │
│ - Update DB │
└─────────────────────┘
Configuration
| Parameter | Default | Purpose |
|---|---|---|
BATCH_ENABLED | false | Feature flag to enable batch processing |
BATCH_MIN_SIZE | 5 | Minimum requests before submitting a batch |
BATCH_MAX_WAIT_MS | 60_000 | Maximum time to accumulate before flushing |
BATCH_MAX_SIZE | 100 | Maximum requests per batch (inline limit) |
BATCH_FILE_THRESHOLD | 50 | Switch from inline to JSONL file above this count |
Schema Mapping
Eko uses Zod schemas via Vercel AI SDK's generateObject(). The Batch API requires JSON Schema format. The mapping is straightforward:
import { zodToJsonSchema } from 'zod-to-json-schema'
// Current: Vercel AI SDK
generateObject({ model, schema: factExtractionSchema, ... })
// Batch equivalent: @google/genai
{
contents: [{ parts: [{ text: prompt }], role: 'user' }],
config: {
systemInstruction: { parts: [{ text: systemPrompt }] },
responseMimeType: 'application/json',
responseSchema: zodToJsonSchema(factExtractionSchema),
temperature: 0.5,
}
}
Cost Tracking Integration
The existing recordCost() function works unchanged. Costs are recorded when batch results arrive:
for (const response of batchJob.dest.inlinedResponses) {
const usage = response.response.usageMetadata
await recordCost({
model: 'gemini-2.5-flash',
feature: taskType,
inputTokens: usage.promptTokenCount,
outputTokens: usage.candidatesTokenCount,
})
}
The estimateCost() function in cost-tracker.ts should be updated to apply the 50% batch discount when the call source is a batch job, so the ai_cost_log table accurately reflects actual spend.
Implementation Phases
Phase 1: Seeding Pipeline (Lowest Risk, Immediate Value)
Scope: seed_explosion, super_fact_discovery, content_cleanup
These are fully offline, run in manual or scheduled bursts, and have no user-facing latency requirements. Perfect for proving out the batch integration.
Changes:
- Add
@google/genaiSDK dependency topackages/ai - Create
packages/ai/src/batch-client.ts— thin wrapper aroundGoogleGenAI.batches - Create
packages/ai/src/batch-accumulator.ts— request collection + flush logic - Modify
seed-explosion.tsto optionally route through batch accumulator - Add
BATCH_ENABLEDfeature flag to@eko/config - Update
cost-tracker.tsto support batch discount in estimates
Risk: Zero impact on live pipeline. Seeding is developer-initiated.
Phase 2: Daily Cron Tasks
Scope: evergreen_generation, notability_scoring, entity_classification
These run on the daily 3 AM UTC cron. The cron route already dispatches all topics as separate queue messages — instead, it could collect them into a single batch.
Changes:
- Modify
generate-evergreencron route to support batch dispatch mode - Add batch result polling to worker-facts
- Extend batch accumulator with JSONL file support for larger batches
Phase 3: Ingestion Pipeline
Scope: fact_extraction, challenge_content_generation
These are the highest-volume tasks. Fact extraction runs every 15 minutes (news cron), and challenge generation fans out after every validated fact.
Changes:
- Add batch accumulation to
worker-factsextraction handler - Add batch accumulation to challenge content handler
- Handle partial failures (some requests in a batch may fail while others succeed)
Phase 4: Validation (Selective)
Scope: fact_validation (seeding runs only)
Live ingestion validation stays synchronous for fastest time-to-feed. Seeding runs, which generate hundreds of facts at once, route validation through batch.
Changes:
- Add
batchModeflag toVALIDATE_FACTqueue message schema - Worker-validate checks flag and routes accordingly
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Batch job failure | Facts stuck in pending state | Automatic fallback to synchronous on batch failure; dead-letter queue for failed items |
| Latency variance | Batch SLO is 24h (typically minutes) | Time-box polling; escalate to sync after 30 min |
| SDK compatibility | @google/genai may conflict with @ai-sdk/google | Both use REST API with same key; tested coexistence in Node |
| Schema translation | Zod → JSON Schema may lose edge cases | Validate round-trip: generate via batch, validate output against Zod schema |
| Partial batch failure | Some requests succeed, others fail | Process successful results; re-enqueue failed items as individual sync calls |
| Cost tracking accuracy | Batch discount not reflected in estimates | Update estimateCost() with batchDiscount parameter |
Dependencies
| Dependency | Version | Purpose |
|---|---|---|
@google/genai | latest | Batch API client (submit, poll, retrieve) |
zod-to-json-schema | latest | Convert Zod schemas to JSON Schema for batch requests |
Decision Log
| Decision | Rationale |
|---|---|
Use @google/genai SDK (not REST directly) | Official SDK handles auth, polling, file upload; matches Batch API docs |
| Start with seeding pipeline | Zero risk to live users; highest batch density; proves integration |
| Keep synchronous path as fallback | Feature flag + automatic fallback ensures no regression |
| Inline requests before JSONL | Simpler; sufficient for batches under 50 requests |
Don't batch conversational_turn | User-facing latency requirement incompatible with async processing |
Key Files
| File | Role |
|---|---|
packages/ai/src/model-router.ts | Model selection and instantiation (add batch routing) |
packages/ai/src/fact-engine.ts | AI functions (add batch dispatch option) |
packages/ai/src/cost-tracker.ts | Cost recording (add batch discount) |
packages/ai/src/challenge-content.ts | Challenge generation (batch candidate) |
packages/ai/src/seed-explosion.ts | Seeding pipeline (Phase 1 target) |
packages/ai/src/models/adapters/gemini-2.5-flash.ts | Prompt prefixes (reused in batch requests) |
packages/config/src/index.ts | Feature flags and env config |
apps/worker-facts/src/index.ts | Queue consumer (add batch accumulation) |
apps/worker-validate/src/index.ts | Validation queue (Phase 4 selective batching) |
Related
- News & Fact Engine — System reference for the full pipeline
- Evergreen Ingestion — Daily cron pipeline (Phase 2 target)
- Gemini Batch API Docs — Official documentation
- APP-CONTROL.md — Operational manifest (crons, workers, queues)