5. Compare Models Head-to-Head

Purpose: Run the quality pipeline on two models in parallel, then merge results into a side-by-side comparison report.

Prerequisites:

  • Supabase credentials (SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, SUPABASE_ANON_KEY)
  • API keys for both target models (see Available Models for env var names)
  • Active database with seeded entities

Cost / Duration: ~$16-50 total (two models) | 20-60 minutes

Prompt

Run a head-to-head comparison between two models.

Pick two models from the available models list (see index).
Example pairings: gpt-5.4-nano vs deepseek-chat, gemini-2.5-flash vs mistral-small-latest

Step 1 — Run each model into its own output directory (can run in parallel terminals):

```bash
# Terminal 1
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_A> --limit 50 --output-dir model-a-run

# Terminal 2
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_B> --limit 50 --output-dir model-b-run
```

Step 2 — Merge results into a comparison report:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --report --merge-dirs model-a-run,model-b-run
```

Replace `<MODEL_A>` and `<MODEL_B>` with the two model IDs to compare.

The merged report shows side-by-side dimension scores, cost-per-fact,
latency, and overall eligibility comparison.

### With Local Supabase

Add `--commit` to both runs to also write to a local Supabase instance:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_A> --limit 50 --output-dir model-a-run --commit
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL_B> --limit 50 --output-dir model-b-run --commit
```

Verification

  • Both model runs complete all 5 phases
  • Each output directory contains its own JSONL data files
  • Merged report generated with side-by-side dimension scores
  • Cost-per-fact and latency comparison included
  • Clear winner identified or parity noted per dimension

Back to index