5. Compare Models Head-to-Head
Purpose: Run the quality pipeline on two models in parallel, then merge results into a side-by-side comparison report.
Prerequisites:
- Supabase credentials (
SUPABASE_URL,SUPABASE_SERVICE_ROLE_KEY,SUPABASE_ANON_KEY) - API keys for both target models (see Available Models for env var names)
- Active database with seeded entities
Cost / Duration: ~$16-50 total (two models) | 20-60 minutes
Prompt
Run a head-to-head comparison between two models.
Pick two models from the available models list (see index).
Example pairings: gpt-5.4-nano vs deepseek-chat, gemini-2.5-flash vs mistral-small-latest
Step 1 — Run each model into its own output directory (can run in parallel terminals):
```bash
# Terminal 1
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_A> --limit 50 --output-dir model-a-run
# Terminal 2
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_B> --limit 50 --output-dir model-b-run
```
Step 2 — Merge results into a comparison report:
```bash
bun scripts/seed/llm-fact-quality-testing.ts --report --merge-dirs model-a-run,model-b-run
```
Replace `<MODEL_A>` and `<MODEL_B>` with the two model IDs to compare.
The merged report shows side-by-side dimension scores, cost-per-fact,
latency, and overall eligibility comparison.
### With Local Supabase
Add `--commit` to both runs to also write to a local Supabase instance:
```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_A> --limit 50 --output-dir model-a-run --commit
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_B> --limit 50 --output-dir model-b-run --commit
```
Verification
- Both model runs complete all 5 phases
- Each output directory contains its own JSONL data files
- Merged report generated with side-by-side dimension scores
- Cost-per-fact and latency comparison included
- Clear winner identified or parity noted per dimension
Related Prompts
- Full Model Adapter Test — Single model eligibility run
- Light Model Adapter Smoke Test — Quick contract check (run first)