Model Testing Prompts
Prompts for testing AI model adapters, comparing model quality, and validating new model integrations.
Prompts
| # | Prompt | Cost | Duration |
|---|---|---|---|
| 1 | Simple Adapter Smoke Test | $0 | ~5-10s |
| 2 | Light Adapter Smoke Test | $0 | <5s |
| 3 | Iterative Eligibility Gate | ~$1-15 | 3-30 min |
| 4 | Full Adapter Test | ~$8-25/model | 10-30 min |
| 5 | Compare Models Head-to-Head | ~$16-50 | 20-60 min |
| 6 | Add New Model Adapter | $0-$25 | varies |
Available Models
All models registered in the test harness (scripts/seed/lib/llm-test-harness.ts):
| Model | Provider | API Key Env Var | Notes |
|---|---|---|---|
gpt-5.4-mini | OpenAI | OPENAI_API_KEY | High-tier escalation model |
gpt-5.4-nano | OpenAI | OPENAI_API_KEY | Excludes sports, music topics (evidence fabrication, validation failures) |
gpt-5-mini | OpenAI | OPENAI_API_KEY | Deprecated — use gpt-5.4-mini |
gpt-5-nano | OpenAI | OPENAI_API_KEY | Deprecated — use gpt-5.4-nano |
gpt-4o-mini | OpenAI | OPENAI_API_KEY | |
claude-haiku-4-5 | Anthropic | ANTHROPIC_API_KEY | |
grok-4-1-fast-non-reasoning | xAI | XAI_API_KEY | |
gemini-2.0-flash-lite | GOOGLE_API_KEY | ||
gemini-2.5-flash | GOOGLE_API_KEY | v5 adapter, most thoroughly tuned | |
gemini-3-flash-preview | GOOGLE_API_KEY | Default signoff reviewer | |
MiniMax-M2.5 | MiniMax | MINIMAX_API_KEY | |
deepseek-chat | DeepSeek | DEEPSEEK_API_KEY | |
mistral-large-latest | Mistral | MISTRAL_API_KEY | |
mistral-medium-latest | Mistral | MISTRAL_API_KEY | |
mistral-small-latest | Mistral | MISTRAL_API_KEY |
Provider Concurrency Caps
The test harness enforces per-provider concurrency limits to avoid rate limiting:
| Provider | Max Concurrent Calls |
|---|---|
| 15 | |
| OpenAI | 10 |
| DeepSeek | 10 |
| Mistral | 10 |
| Anthropic | 8 |
| xAI | 5 |
| MiniMax | 3 |
Local Supabase Testing
Tests that hit the database (prompts 3-5) support two modes:
- JSONL-only (default): Results written to
scripts/seed/.llm-test-data/as JSONL files. No database required beyond Supabase credentials for schema/category lookups. - Local Supabase (
--commit): Starts a local Supabase instance via Docker, seeds reference data, and writes fact records + challenge content to the local database. Useful for testing the full DB write path (inserts, RLS, constraints) without touching production.
# JSONL-only (default)
bun scripts/seed/llm-fact-quality-testing.ts --all --models deepseek-chat
# With local Supabase writes
bun scripts/seed/llm-fact-quality-testing.ts --all --models deepseek-chat --commit
Requirements for --commit: Docker running, supabase CLI installed. If Docker is unavailable, the pipeline falls back to JSONL-only mode with a warning.