Model Testing Prompts

Prompts for testing AI model adapters, comparing model quality, and validating new model integrations.

Prompts

#PromptCostDuration
1Simple Adapter Smoke Test$0~5-10s
2Light Adapter Smoke Test$0<5s
3Iterative Eligibility Gate~$1-153-30 min
4Full Adapter Test~$8-25/model10-30 min
5Compare Models Head-to-Head~$16-5020-60 min
6Add New Model Adapter$0-$25varies

Available Models

All models registered in the test harness (scripts/seed/lib/llm-test-harness.ts):

ModelProviderAPI Key Env VarNotes
gpt-5.4-miniOpenAIOPENAI_API_KEYHigh-tier escalation model
gpt-5.4-nanoOpenAIOPENAI_API_KEYExcludes sports, music topics (evidence fabrication, validation failures)
gpt-5-miniOpenAIOPENAI_API_KEYDeprecated — use gpt-5.4-mini
gpt-5-nanoOpenAIOPENAI_API_KEYDeprecated — use gpt-5.4-nano
gpt-4o-miniOpenAIOPENAI_API_KEY
claude-haiku-4-5AnthropicANTHROPIC_API_KEY
grok-4-1-fast-non-reasoningxAIXAI_API_KEY
gemini-2.0-flash-liteGoogleGOOGLE_API_KEY
gemini-2.5-flashGoogleGOOGLE_API_KEYv5 adapter, most thoroughly tuned
gemini-3-flash-previewGoogleGOOGLE_API_KEYDefault signoff reviewer
MiniMax-M2.5MiniMaxMINIMAX_API_KEY
deepseek-chatDeepSeekDEEPSEEK_API_KEY
mistral-large-latestMistralMISTRAL_API_KEY
mistral-medium-latestMistralMISTRAL_API_KEY
mistral-small-latestMistralMISTRAL_API_KEY

Provider Concurrency Caps

The test harness enforces per-provider concurrency limits to avoid rate limiting:

ProviderMax Concurrent Calls
Google15
OpenAI10
DeepSeek10
Mistral10
Anthropic8
xAI5
MiniMax3

Local Supabase Testing

Tests that hit the database (prompts 3-5) support two modes:

  • JSONL-only (default): Results written to scripts/seed/.llm-test-data/ as JSONL files. No database required beyond Supabase credentials for schema/category lookups.
  • Local Supabase (--commit): Starts a local Supabase instance via Docker, seeds reference data, and writes fact records + challenge content to the local database. Useful for testing the full DB write path (inserts, RLS, constraints) without touching production.
# JSONL-only (default)
bun scripts/seed/llm-fact-quality-testing.ts --all --models deepseek-chat

# With local Supabase writes
bun scripts/seed/llm-fact-quality-testing.ts --all --models deepseek-chat --commit

Requirements for --commit: Docker running, supabase CLI installed. If Docker is unavailable, the pipeline falls back to JSONL-only mode with a warning.