Model Testing Prompts

Prompts for testing AI model adapters, comparing model quality, and validating new model integrations.

Prompts

#	Prompt	Cost	Duration
1	Simple Adapter Smoke Test	$0	~5-10s
2	Light Adapter Smoke Test	$0	<5s
3	Iterative Eligibility Gate	~$1-15	3-30 min
4	Full Adapter Test	~$8-25/model	10-30 min
5	Compare Models Head-to-Head	~$16-50	20-60 min
6	Add New Model Adapter	$0-$25	varies

Available Models

All models registered in the test harness (scripts/seed/lib/llm-test-harness.ts):

Model	Provider	API Key Env Var	Notes
`gpt-5.4-mini`	OpenAI	`OPENAI_API_KEY`	High-tier escalation model
`gpt-5.4-nano`	OpenAI	`OPENAI_API_KEY`	Excludes sports, music topics (evidence fabrication, validation failures)
`gpt-5-mini`	OpenAI	`OPENAI_API_KEY`	Deprecated — use gpt-5.4-mini
`gpt-5-nano`	OpenAI	`OPENAI_API_KEY`	Deprecated — use gpt-5.4-nano
`gpt-4o-mini`	OpenAI	`OPENAI_API_KEY`
`claude-haiku-4-5`	Anthropic	`ANTHROPIC_API_KEY`
`grok-4-1-fast-non-reasoning`	xAI	`XAI_API_KEY`
`gemini-2.0-flash-lite`	Google	`GOOGLE_API_KEY`
`gemini-2.5-flash`	Google	`GOOGLE_API_KEY`	v5 adapter, most thoroughly tuned
`gemini-3-flash-preview`	Google	`GOOGLE_API_KEY`	Default signoff reviewer
`MiniMax-M2.5`	MiniMax	`MINIMAX_API_KEY`
`deepseek-chat`	DeepSeek	`DEEPSEEK_API_KEY`
`mistral-large-latest`	Mistral	`MISTRAL_API_KEY`
`mistral-medium-latest`	Mistral	`MISTRAL_API_KEY`
`mistral-small-latest`	Mistral	`MISTRAL_API_KEY`

Provider Concurrency Caps

The test harness enforces per-provider concurrency limits to avoid rate limiting:

Provider	Max Concurrent Calls
Google	15
OpenAI	10
DeepSeek	10
Mistral	10
Anthropic	8
xAI	5
MiniMax	3

Local Supabase Testing

Tests that hit the database (prompts 3-5) support two modes:

JSONL-only (default): Results written to scripts/seed/.llm-test-data/ as JSONL files. No database required beyond Supabase credentials for schema/category lookups.
Local Supabase (--commit): Starts a local Supabase instance via Docker, seeds reference data, and writes fact records + challenge content to the local database. Useful for testing the full DB write path (inserts, RLS, constraints) without touching production.

# JSONL-only (default)
bun scripts/seed/llm-fact-quality-testing.ts --all --models deepseek-chat

# With local Supabase writes
bun scripts/seed/llm-fact-quality-testing.ts --all --models deepseek-chat --commit

Requirements for --commit: Docker running, supabase CLI installed. If Docker is unavailable, the pipeline falls back to JSONL-only mode with a warning.

#Model Testing Prompts

#Prompts

#Available Models

#Provider Concurrency Caps

#Local Supabase Testing

Model Testing Prompts

Prompts

Available Models

Provider Concurrency Caps

Local Supabase Testing