4. Full Adapter Test

Purpose: Run the complete LLM quality testing pipeline against a specific model to evaluate eligibility across 7 dimensions.

Prerequisites:

  • Supabase credentials (SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, SUPABASE_ANON_KEY)
  • API key for the target model (see Available Models for env var names)
  • Active database with seeded topic_categories and fact_record_schemas

Cost / Duration: ~$8-25 per model (at --limit 50) | 10-30 minutes per model

Prompt

Run the full LLM fact quality testing pipeline for a specific model.

Pick a model from the available models list (see index).
Common choices: gpt-5.4-nano, gemini-2.5-flash, deepseek-chat, mistral-small-latest

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 50
```

Replace `<MODEL>` with the target model ID (e.g., `deepseek-chat`).

This runs all 5 phases:
1. **Generate** — Extract facts from DB entities using the target model
2. **Validate** — Run structural, consistency, cross-model, and evidence validation
3. **Challenge** — Generate challenge content (quiz/recall) for each fact
4. **Signoff** — AI-judge review of fact+challenge quality
5. **Report** — Produce eligibility report with dimension scores

After completion, review the generated report and verify all 7 eligibility
dimensions meet the ≥97% threshold:
- validation, evidence, challenges, schema_adherence,
  voice_adherence, style_adherence, token_efficiency

### Individual Phase Runs

You can run phases independently (useful for debugging or re-running a failed phase):

```bash
bun scripts/seed/llm-fact-quality-testing.ts --generate --models <MODEL> --limit 50
bun scripts/seed/llm-fact-quality-testing.ts --validate
bun scripts/seed/llm-fact-quality-testing.ts --challenge
bun scripts/seed/llm-fact-quality-testing.ts --signoff
bun scripts/seed/llm-fact-quality-testing.ts --report
```

### Local Supabase Mode

Add `--commit` to write results to a local Supabase instance (requires Docker):

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 50 --commit
```

This starts/reuses a Docker-based local Supabase, seeds reference data (`topic_categories`,
`fact_record_schemas`), and writes generated fact records and challenge content to the local DB.
Useful for testing the full insert path (RLS policies, FK constraints, triggers) without
touching production. Falls back to JSONL-only if Docker is unavailable.

### Additional Options

| Flag | Default | Description |
|------|---------|-------------|
| `--signoff-model M` | `gemini-3-flash-preview` | Model used as AI-judge for quality review |
| `--concurrency N` | `8` | Max parallel API calls per phase |
| `--output-dir DIR` | `.llm-test-data/` | Custom output directory (relative to `.llm-test-data/`, or absolute) |

Verification

  • Pipeline completes all 5 phases without errors
  • Report generated at scripts/seed/.llm-test-data/report-<model>.md
  • All 7 eligibility dimensions meet thresholds (≥97% structural: validation/evidence/challenges; ≥90% subjective: schema/voice/style/token_efficiency)
  • Cost tracked and reported in final summary
  • No validation failures in the signoff phase
  • FCG diversity test passes: bun scripts/seed/test-fcg-diversity.ts --limit 5 --models <MODEL>

Model-Specific Notes

DeepSeek (deepseek-chat)

  • Free-text generation mode (V4): DeepSeek does not use json_schema structured output. A custom fetch wrapper in model-router.ts downgrades json_schema → json_object and injects schema guidance into the system prompt as human-readable text.
  • Known T1 canary tendencies: Academic register, home/away team swaps in sports, overly narrative fact_values, date format inconsistency (natural language vs ISO).
  • Cheapest model tested: ~$0.0023/fact — excellent for high-volume canary runs.
  • Voice/style scores: Strong voice (92/100), moderate style (84/100) in initial testing.

GPT-5.4 Nano (gpt-5.4-nano)

  • Topic exclusion: Excludes sports and music topics due to persistent quality failures. The model router falls back to gemini-3-flash-preview for excluded topics.
  • FCG key diversity: Historically defaulted to one "easy" key for 60%+ of challenges. Now mitigated by validateGroupDiversity() + retry with explicit key assignments in generateGroupChallenges().
  • Replaces: gpt-5-nano (deprecated).

GPT-5.4 Mini (gpt-5.4-mini)

  • High-tier escalation model used when default-tier models fail quality gates.
  • Replaces: gpt-5-mini (deprecated).

Back to index