2. Audit Fact Quality

Purpose: Run a read-only audit of fact corpus health — null rates, validation pass rates, topic distribution, and duplicate detection.

Prerequisites:

  • Supabase credentials (read-only access sufficient)

Cost / Duration: $0 (read-only queries) | 1-2 minutes

Prompt

Run a fact quality audit to assess corpus health.

Step 1 — Run the backfill audit (read-only, no modifications):

```bash
bun scripts/seed/backfill-fact-nulls.ts --audit
```

Step 2 — Run these SQL queries against the database to get a full health picture:

```sql
-- Null rates by field
SELECT
  COUNT(*) FILTER (WHERE notability_score IS NULL) AS null_notability,
  COUNT(*) FILTER (WHERE image_url IS NULL) AS null_images,
  COUNT(*) AS total_facts
FROM fact_records
WHERE status = 'validated';

-- Validation pass rates
SELECT
  status,
  COUNT(*) AS count,
  ROUND(COUNT(*)::numeric / SUM(COUNT(*)) OVER () * 100, 1) AS pct
FROM fact_records
GROUP BY status
ORDER BY count DESC;

-- Topic distribution
SELECT
  tc.name AS topic,
  COUNT(f.id) AS fact_count
FROM fact_records f
JOIN topic_categories tc ON f.topic_category_id = tc.id
WHERE f.status = 'validated'
GROUP BY tc.name
ORDER BY fact_count DESC;

-- Challenge content coverage
SELECT
  COUNT(DISTINCT f.id) FILTER (WHERE fcc.id IS NOT NULL) AS facts_with_challenges,
  COUNT(DISTINCT f.id) AS total_validated
FROM fact_records f
LEFT JOIN fact_challenge_content fcc ON fcc.fact_record_id = f.id
WHERE f.status = 'validated';
```

Report the results as a health scorecard with recommendations for any
areas below threshold (e.g., null rates > 5%, topic imbalance > 3:1).

Verification

  • Backfill audit completes without errors
  • Null rates reported for notability, challenges, images
  • Validation pass rate reported (target: >95% active)
  • Topic distribution reported with balance ratio
  • Duplicate detection stats reported
  • Recommendations provided for any below-threshold areas

Back to index