3. Iterative Eligibility Gate
Purpose: Validate a single model meets eligibility thresholds (≥97% structural, ≥90% subjective) across all 7 dimensions via graduated volume tiers, catching failures early before committing full pipeline cost.
Prerequisites:
- Supabase credentials (
SUPABASE_URL,SUPABASE_SERVICE_ROLE_KEY,SUPABASE_ANON_KEY) - API key for the target model (see Available Models for env var names)
- Active database with seeded
topic_categoriesandfact_record_schemas - Light Model Adapter Smoke Test passing for the target model
Cost / Duration: ~$1-3 (fast pass at T1) to ~$5-15 (all three tiers) | 3-30 minutes
Tier Design
| Tier | --limit | Threshold | Est. Cost | Rationale |
|---|---|---|---|---|
| T1 Canary | 10 | ≥90% | ~$1-3 | Small N makes 97% brittle; 90% catches gross failures |
| T2 Coverage | 25 | ≥95% | ~$3-8 | Full entity coverage (≥1/entity); tighter threshold |
| T3 Production Gate | 50 | ≥97%/90% | ~$5-15 | Production threshold (97% structural, 90% subjective); statistically meaningful N |
Each tier runs in its own --output-dir for isolated data. If a tier fails its threshold, abort immediately — do not proceed to the next tier.
Prompt
Run a graduated eligibility gate for a specific model.
Pick a model from the available models list (see index).
Common choices: gpt-5.4-nano, gemini-2.5-flash, deepseek-chat, mistral-small-latest
### Tier 1 — Canary (--limit 10, threshold ≥90%)
```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 10 --output-dir t1-canary
```
Review the report. If ANY of the 7 eligibility dimensions scores below 90%, STOP.
Report which dimensions failed and by how much.
### Tier 2 — Coverage (--limit 25, threshold ≥95%)
Only proceed if T1 passed.
```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 25 --output-dir t2-coverage
```
Review the report. If ANY dimension scores below 95%, STOP.
Report which dimensions failed and by how much.
### Tier 3 — Production Gate (--limit 50, threshold ≥97%)
Only proceed if T2 passed.
```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 50 --output-dir t3-production
```
Review the report. All 7 dimensions must score ≥97% to pass the production gate.
### Progression Report (optional)
After all tiers pass (or after the last completed tier), merge results
for a side-by-side progression view:
```bash
bun scripts/seed/llm-fact-quality-testing.ts --report --merge-dirs t1-canary,t2-coverage,t3-production
```
Replace `<MODEL>` with the target model ID throughout.
### Local Supabase Mode (optional)
To also write generated facts and challenges to a local Supabase database,
add `--commit` to any tier command:
```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 10 --output-dir t1-canary --commit
```
This starts a Docker-based local Supabase instance, seeds reference data,
and writes fact records + challenge content to the local DB. Requires Docker.
Falls back to JSONL-only mode if Docker is unavailable.
Verification
- T1 Canary: all 7 dimensions ≥90% (or abort reported)
- T2 Coverage: all 7 dimensions ≥95% (or abort reported)
- T3 Production Gate: all 7 dimensions ≥97% (or abort reported)
- Each tier output is isolated in its own directory
- Cost tracked and reported per tier
- Progression report generated (if all tiers passed)
- Total cost stays within ~$5-15 for a full 3-tier pass
When to Use This vs. Full Test
| Scenario | Use This (03) | Use Full Test (04) |
|---|---|---|
| New adapter, first eligibility check | Yes | No — too expensive for untested adapter |
| Quick regression check after prompt change | Yes (T1 only) | No |
| Final production sign-off | Yes (all 3 tiers) | Yes — as confirmation |
| Comparing two models | No | Use Compare Models Head-to-Head |
Related Prompts
- Light Model Adapter Smoke Test — Run this first (free, in-memory)
- Full Model Adapter Test — Full pipeline at production volume
- Add a New Model Adapter — Scaffold and register a new adapter