3. Iterative Eligibility Gate

Purpose: Validate a single model meets eligibility thresholds (≥97% structural, ≥90% subjective) across all 7 dimensions via graduated volume tiers, catching failures early before committing full pipeline cost.

Prerequisites:

  • Supabase credentials (SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, SUPABASE_ANON_KEY)
  • API key for the target model (see Available Models for env var names)
  • Active database with seeded topic_categories and fact_record_schemas
  • Light Model Adapter Smoke Test passing for the target model

Cost / Duration: ~$1-3 (fast pass at T1) to ~$5-15 (all three tiers) | 3-30 minutes

Tier Design

Tier--limitThresholdEst. CostRationale
T1 Canary10≥90%~$1-3Small N makes 97% brittle; 90% catches gross failures
T2 Coverage25≥95%~$3-8Full entity coverage (≥1/entity); tighter threshold
T3 Production Gate50≥97%/90%~$5-15Production threshold (97% structural, 90% subjective); statistically meaningful N

Each tier runs in its own --output-dir for isolated data. If a tier fails its threshold, abort immediately — do not proceed to the next tier.

Prompt

Run a graduated eligibility gate for a specific model.

Pick a model from the available models list (see index).
Common choices: gpt-5.4-nano, gemini-2.5-flash, deepseek-chat, mistral-small-latest

### Tier 1 — Canary (--limit 10, threshold ≥90%)

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 10 --output-dir t1-canary
```

Review the report. If ANY of the 7 eligibility dimensions scores below 90%, STOP.
Report which dimensions failed and by how much.

### Tier 2 — Coverage (--limit 25, threshold ≥95%)

Only proceed if T1 passed.

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 25 --output-dir t2-coverage
```

Review the report. If ANY dimension scores below 95%, STOP.
Report which dimensions failed and by how much.

### Tier 3 — Production Gate (--limit 50, threshold ≥97%)

Only proceed if T2 passed.

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 50 --output-dir t3-production
```

Review the report. All 7 dimensions must score ≥97% to pass the production gate.

### Progression Report (optional)

After all tiers pass (or after the last completed tier), merge results
for a side-by-side progression view:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --report --merge-dirs t1-canary,t2-coverage,t3-production
```

Replace `<MODEL>` with the target model ID throughout.

### Local Supabase Mode (optional)

To also write generated facts and challenges to a local Supabase database,
add `--commit` to any tier command:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 10 --output-dir t1-canary --commit
```

This starts a Docker-based local Supabase instance, seeds reference data,
and writes fact records + challenge content to the local DB. Requires Docker.
Falls back to JSONL-only mode if Docker is unavailable.

Verification

  • T1 Canary: all 7 dimensions ≥90% (or abort reported)
  • T2 Coverage: all 7 dimensions ≥95% (or abort reported)
  • T3 Production Gate: all 7 dimensions ≥97% (or abort reported)
  • Each tier output is isolated in its own directory
  • Cost tracked and reported per tier
  • Progression report generated (if all tiers passed)
  • Total cost stays within ~$5-15 for a full 3-tier pass

When to Use This vs. Full Test

ScenarioUse This (03)Use Full Test (04)
New adapter, first eligibility checkYesNo — too expensive for untested adapter
Quick regression check after prompt changeYes (T1 only)No
Final production sign-offYes (all 3 tiers)Yes — as confirmation
Comparing two modelsNoUse Compare Models Head-to-Head

Back to index