5. Compare Models Head-to-Head

Purpose: Run the quality pipeline on two models in parallel, then merge results into a side-by-side comparison report.

Prerequisites:

Supabase credentials (SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, SUPABASE_ANON_KEY)
API keys for both target models (see Available Models for env var names)
Active database with seeded entities

Cost / Duration: ~$16-50 total (two models) | 20-60 minutes

Prompt

Run a head-to-head comparison between two models.

Pick two models from the available models list (see index).
Example pairings: gpt-5.4-nano vs deepseek-chat, gemini-2.5-flash vs mistral-small-latest

Step 1 — Run each model into its own output directory (can run in parallel terminals):

```bash
# Terminal 1
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_A> --limit 50 --output-dir model-a-run

# Terminal 2
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_B> --limit 50 --output-dir model-b-run
```

Step 2 — Merge results into a comparison report:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --report --merge-dirs model-a-run,model-b-run
```

Replace `<MODEL_A>` and `<MODEL_B>` with the two model IDs to compare.

The merged report shows side-by-side dimension scores, cost-per-fact,
latency, and overall eligibility comparison.

### With Local Supabase

Add `--commit` to both runs to also write to a local Supabase instance:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_A> --limit 50 --output-dir model-a-run --commit
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_B> --limit 50 --output-dir model-b-run --commit
```

Verification

Both model runs complete all 5 phases
Each output directory contains its own JSONL data files
Merged report generated with side-by-side dimension scores
Cost-per-fact and latency comparison included
Clear winner identified or parity noted per dimension

Full Model Adapter Test — Single model eligibility run
Light Model Adapter Smoke Test — Quick contract check (run first)

Back to index

#5. Compare Models Head-to-Head

#Prompt

#Verification

#Related Prompts

5. Compare Models Head-to-Head

Prompt

Verification

Related Prompts