3. Iterative Eligibility Gate

Purpose: Validate a single model meets eligibility thresholds (≥97% structural, ≥90% subjective) across all 7 dimensions via graduated volume tiers, catching failures early before committing full pipeline cost.

Prerequisites:

Supabase credentials (SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, SUPABASE_ANON_KEY)
API key for the target model (see Available Models for env var names)
Active database with seeded topic_categories and fact_record_schemas
Light Model Adapter Smoke Test passing for the target model

Cost / Duration: ~$1-3 (fast pass at T1) to ~$5-15 (all three tiers) | 3-30 minutes

Tier Design

Tier	--limit	Threshold	Est. Cost	Rationale
T1 Canary	10	≥90%	~$1-3	Small N makes 97% brittle; 90% catches gross failures
T2 Coverage	25	≥95%	~$3-8	Full entity coverage (≥1/entity); tighter threshold
T3 Production Gate	50	≥97%/90%	~$5-15	Production threshold (97% structural, 90% subjective); statistically meaningful N

Each tier runs in its own --output-dir for isolated data. If a tier fails its threshold, abort immediately — do not proceed to the next tier.

Prompt

Run a graduated eligibility gate for a specific model.

Pick a model from the available models list (see index).
Common choices: gpt-5.4-nano, gemini-2.5-flash, deepseek-chat, mistral-small-latest

### Tier 1 — Canary (--limit 10, threshold ≥90%)

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 10 --output-dir t1-canary
```

Review the report. If ANY of the 7 eligibility dimensions scores below 90%, STOP.
Report which dimensions failed and by how much.

### Tier 2 — Coverage (--limit 25, threshold ≥95%)

Only proceed if T1 passed.

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 25 --output-dir t2-coverage
```

Review the report. If ANY dimension scores below 95%, STOP.
Report which dimensions failed and by how much.

### Tier 3 — Production Gate (--limit 50, threshold ≥97%)

Only proceed if T2 passed.

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 50 --output-dir t3-production
```

Review the report. All 7 dimensions must score ≥97% to pass the production gate.

### Progression Report (optional)

After all tiers pass (or after the last completed tier), merge results
for a side-by-side progression view:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --report --merge-dirs t1-canary,t2-coverage,t3-production
```

Replace `<MODEL>` with the target model ID throughout.

### Local Supabase Mode (optional)

To also write generated facts and challenges to a local Supabase database,
add `--commit` to any tier command:

```bash
bun scripts/seed/llm-fact-quality-testing.ts --all --models <MODEL> --limit 10 --output-dir t1-canary --commit
```

This starts a Docker-based local Supabase instance, seeds reference data,
and writes fact records + challenge content to the local DB. Requires Docker.
Falls back to JSONL-only mode if Docker is unavailable.

Verification

T1 Canary: all 7 dimensions ≥90% (or abort reported)
T2 Coverage: all 7 dimensions ≥95% (or abort reported)
T3 Production Gate: all 7 dimensions ≥97% (or abort reported)
Each tier output is isolated in its own directory
Cost tracked and reported per tier
Progression report generated (if all tiers passed)
Total cost stays within ~$5-15 for a full 3-tier pass

When to Use This vs. Full Test

Scenario	Use This (03)	Use Full Test (04)
New adapter, first eligibility check	Yes	No — too expensive for untested adapter
Quick regression check after prompt change	Yes (T1 only)	No
Final production sign-off	Yes (all 3 tiers)	Yes — as confirmation
Comparing two models	No	Use Compare Models Head-to-Head

Light Model Adapter Smoke Test — Run this first (free, in-memory)
Full Model Adapter Test — Full pipeline at production volume
Add a New Model Adapter — Scaffold and register a new adapter

Back to index

#3. Iterative Eligibility Gate

#Tier Design

#Prompt

#Verification

#When to Use This vs. Full Test

#Related Prompts

3. Iterative Eligibility Gate

Tier Design

Prompt

Verification

When to Use This vs. Full Test

Related Prompts