4. Full Adapter Test
Purpose: Run the complete LLM quality testing pipeline against a specific model to evaluate eligibility across 7 dimensions.
Prerequisites:
- Supabase credentials (
SUPABASE_URL,SUPABASE_SERVICE_ROLE_KEY,SUPABASE_ANON_KEY) - API key for the target model (see Available Models for env var names)
- Active database with seeded
topic_categoriesandfact_record_schemas
Cost / Duration: ~$8-25 per model (at --limit 50) | 10-30 minutes per model
Prompt
Run the full LLM fact quality testing pipeline for a specific model.
Pick a model from the available models list (see index).
Common choices: gpt-5.4-nano, gemini-2.5-flash, deepseek-chat, mistral-small-latest
```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 50
```
Replace `<MODEL>` with the target model ID (e.g., `deepseek-chat`).
This runs all 5 phases:
1. **Generate** — Extract facts from DB entities using the target model
2. **Validate** — Run structural, consistency, cross-model, and evidence validation
3. **Challenge** — Generate challenge content (quiz/recall) for each fact
4. **Signoff** — AI-judge review of fact+challenge quality
5. **Report** — Produce eligibility report with dimension scores
After completion, review the generated report and verify all 7 eligibility
dimensions meet the ≥97% threshold:
- validation, evidence, challenges, schema_adherence,
voice_adherence, style_adherence, token_efficiency
### Individual Phase Runs
You can run phases independently (useful for debugging or re-running a failed phase):
```bash
bun scripts/seed/llm-fact-quality-testing.ts --generate --models <MODEL> --limit 50
bun scripts/seed/llm-fact-quality-testing.ts --validate
bun scripts/seed/llm-fact-quality-testing.ts --challenge
bun scripts/seed/llm-fact-quality-testing.ts --signoff
bun scripts/seed/llm-fact-quality-testing.ts --report
```
### Local Supabase Mode
Add `--commit` to write results to a local Supabase instance (requires Docker):
```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 50 --commit
```
This starts/reuses a Docker-based local Supabase, seeds reference data (`topic_categories`,
`fact_record_schemas`), and writes generated fact records and challenge content to the local DB.
Useful for testing the full insert path (RLS policies, FK constraints, triggers) without
touching production. Falls back to JSONL-only if Docker is unavailable.
### Additional Options
| Flag | Default | Description |
|------|---------|-------------|
| `--signoff-model M` | `gemini-3-flash-preview` | Model used as AI-judge for quality review |
| `--concurrency N` | `8` | Max parallel API calls per phase |
| `--output-dir DIR` | `.llm-test-data/` | Custom output directory (relative to `.llm-test-data/`, or absolute) |
Verification
- Pipeline completes all 5 phases without errors
- Report generated at
scripts/seed/.llm-test-data/report-<model>.md - All 7 eligibility dimensions meet thresholds (≥97% structural: validation/evidence/challenges; ≥90% subjective: schema/voice/style/token_efficiency)
- Cost tracked and reported in final summary
- No validation failures in the
signoffphase - FCG diversity test passes:
bun scripts/seed/test-fcg-diversity.ts --limit 5 --models <MODEL>
Model-Specific Notes
DeepSeek (deepseek-chat)
- Free-text generation mode (V4): DeepSeek does not use
json_schemastructured output. A custom fetch wrapper inmodel-router.tsdowngradesjson_schema → json_objectand injects schema guidance into the system prompt as human-readable text. - Known T1 canary tendencies: Academic register, home/away team swaps in sports, overly narrative
fact_values, date format inconsistency (natural language vs ISO). - Cheapest model tested: ~$0.0023/fact — excellent for high-volume canary runs.
- Voice/style scores: Strong voice (92/100), moderate style (84/100) in initial testing.
GPT-5.4 Nano (gpt-5.4-nano)
- Topic exclusion: Excludes
sportsandmusictopics due to persistent quality failures. The model router falls back togemini-3-flash-previewfor excluded topics. - FCG key diversity: Historically defaulted to one "easy" key for 60%+ of challenges. Now mitigated by
validateGroupDiversity()+ retry with explicit key assignments ingenerateGroupChallenges(). - Replaces:
gpt-5-nano(deprecated).
GPT-5.4 Mini (gpt-5.4-mini)
- High-tier escalation model used when default-tier models fail quality gates.
- Replaces:
gpt-5-mini(deprecated).
Related Prompts
- Light Model Adapter Smoke Test — Quick contract check (run first)
- Iterative Eligibility Gate Test — Iterative eligibility tests
- Compare Models Head-to-Head — Run two models and compare
- Add a New Model Adapter — Scaffold a new adapter before testing