CSV → JSON Config Pipeline

Summary

Replace hardcoded TypeScript constants and YAML-in-markdown control files with a CSV → JSON pipeline. CSVs become the source of truth for all structured configuration. Individual sync scripts validate and compile CSVs to typed JSON. Existing TypeScript files become thin wrappers that import JSON and re-export the same typed constants — zero import changes across the codebase.

Motivation

~3,500 lines of structured configuration are currently spread across:

  • Embedded TypeScript constants (challenge-content-rules.ts, taxonomy-content-rules.ts, model-registry.ts, generate-curated-entries.ts)
  • YAML blocks inside markdown (SEED.md, APP-CONTROL.md)

This makes config changes require code deploys, produces noisy diffs, and prevents grid-based editing. A CSV pipeline provides:

  • Visual grid editing via VS Code CSV extensions or Google Sheets
  • Clean git diffs — CSV line changes are readable and reviewable
  • Import/export flexibility — bulk edits, paste from other sources, analysis snapshots
  • Zod validation at sync time catches errors before they reach production

Architecture

File Layout

config/
  csv/                                    ← Source of truth (committed to Git)
    categories.csv                        ← slug, subcategory, count, prompt
    models.csv                            ← modelId, provider, status, input_price, output_price, deprecation_note
    model-tiers.csv                       ← tier, provider, model
    challenge-voices.csv                  ← key, value (tone_prefix, voice_constitution)
    format-voices.csv                     ← format_slug, personality, register, energy, excitement_driver, ...
    format-rules.csv                      ← format_slug, setup, challenge, reveal_correct, reveal_wrong, ...
    style-rules.csv                       ← style, setup, challenge, reveal_correct, reveal_wrong, correct_answer, style_data
    style-voices.csv                      ← style, mechanics, personality, register, phrasing_examples
    difficulty-levels.csv                 ← level, cognitive_demand, prompt_guidance, knowledge_scope, response_expectation
    banned-patterns.csv                   ← pattern, description
    taxonomy-rules.csv                    ← slug, extraction_guidance, challenge_guidance
    taxonomy-avoid.csv                    ← slug, pattern
    taxonomy-domain-terms.csv             ← slug, term, use
    taxonomy-expert-phrases.csv           ← slug, phrase
    taxonomy-prefer-over.csv             ← slug, instead_of, use
    taxonomy-voices.csv                   ← slug, register, energy, excitement_driver
    taxonomy-voice-pitfalls.csv           ← slug, pitfall
    seed-controls.csv                     ← section, key, value, description
    app-controls/
      crons.csv                           ← name, path, schedule, status, summary
      workers.csv                         ← name, app, status, health_port
      worker-queues.csv                   ← worker, queue_name, summary
      queues.csv                          ← name, consumer, status, trigger
      env-controls.csv                    ← section, key, value, required, description
    long-text/                            ← Multi-paragraph text values
      voice-constitution.txt
      super-fact-rules.txt
  generated/                              ← Derived artifacts (committed to Git)
    categories.json
    models.json
    challenge-rules.json
    taxonomy-rules.json
    seed-controls.json
    app-controls.json

Nested Data Strategy: Companion CSVs

Nested arrays are expressed as separate CSV files joined by a foreign key column.

Example — Taxonomy rules with vocabulary:

taxonomy-rules.csv (primary):

slug,extraction_guidance,challenge_guidance
sports,"Always list scores with the higher score first...","Lead with the winner..."

taxonomy-domain-terms.csv (companion, FK = slug):

slug,term,use
sports,roster depth,how many quality players a team has beyond starters
sports,triple-double,double digits in three statistical categories in one game

The sync script groups companion rows by slug and nests them into the primary record's JSON output.

Long-Text Fields: @file References

CSV cells containing @file:path/to/file.txt are resolved at sync time by reading the referenced file inline. This keeps CSVs clean while allowing multi-paragraph text like voice constitutions to live in dedicated editable files.

key,value
tone_prefix,"VOICE: You are writing for Eko..."
voice_constitution,@file:config/csv/long-text/voice-constitution.txt

Sync Scripts

Individual scripts per domain group, sharing common utilities.

Script Layout

scripts/config/
  sync.ts                              ← Orchestrator: runs all domain syncs
  sync-categories.ts                   ← categories.csv → categories.json
  sync-models.ts                       ← models.csv + model-tiers.csv → models.json
  sync-taxonomy.ts                     ← taxonomy-*.csv (7 files) → taxonomy-rules.json
  sync-challenge.ts                    ← format-voices, style-rules, etc. → challenge-rules.json
  sync-seed-controls.ts                ← seed-controls.csv → seed-controls.json
  sync-app-controls.ts                 ← app-controls/*.csv → app-controls.json
  lib/
    csv-reader.ts                      ← Parse CSV + row-level error reporting
    join.ts                            ← Companion CSV foreign-key joining
    validate.ts                        ← Zod validation with file:line:column errors
    checksum.ts                        ← Staleness detection for CI
    file-ref.ts                        ← @file: reference resolution

Shared Library

  • csv-reader.ts — Wraps csv-parse with typed output, tracks source file + row numbers for error reporting
  • join.tsjoinCompanions() groups companion rows by FK, nests them at dot-notation paths (e.g., vocabulary.domain_terms), supports flatten option for single-value arrays
  • validate.tsvalidateRows() runs Zod .parse() per row with file:line:column error formatting
  • checksum.ts — SHA-256 of CSV contents vs JSON contents for CI staleness detection
  • file-ref.ts — Detects @file: prefixes, resolves relative paths, reads file content inline

CLI Commands

# Day-to-day
bun run config:sync                    # Parse + validate + write all JSON
bun run config:sync --only taxonomy    # Sync one domain
bun run config:sync --dry-run          # Validate without writing
bun run config:check                   # CI: fail if JSON is stale vs CSV

# One-time migration
bun run config:export                  # Extract existing constants → CSV
bun run config:export --only taxonomy  # Export one domain
bun run config:verify                  # Prove round-trip matches original constants

Integration: Thin Wrapper Pattern

Existing TypeScript files keep their types, interfaces, and functions. Only the hardcoded data is replaced with JSON imports. All existing imports across the codebase remain unchanged.

Before

// packages/ai/src/challenge-content-rules.ts
export const FORMAT_VOICE: Record<ChallengeFormatSlug, FormatVoice> = {
  big_fan_of: {
    personality: 'You are a fellow superfan...',
    // ... hundreds of lines
  }
}

After

// packages/ai/src/challenge-content-rules.ts
import raw from '../../../config/generated/challenge-rules.json'

// Types stay here (unchanged)
export type ChallengeStyle = 'multiple_choice' | ...
export interface FormatVoice { ... }

// Data comes from JSON, validated at import time
export const FORMAT_VOICE = z.record(
  formatVoiceSchema
).parse(raw.format_voices)

// Functions stay here (unchanged)
export function validateChallengeContent(...) { ... }

Migration Map

FileData removedJSON sourceKeeps
packages/ai/src/challenge-content-rules.tsCHALLENGE_TONE_PREFIX, CHALLENGE_VOICE_CONSTITUTION, FORMAT_VOICE, FORMAT_RULES, STYLE_RULES, STYLE_VOICE, DIFFICULTY_LEVELS, BANNED_PATTERNS, CQ002_PREFIXESchallenge-rules.jsonTypes, interfaces, validation functions, CQ002_REGEX
packages/ai/src/taxonomy-content-rules.tsTAXONOMY_CONTENT_RULES, TAXONOMY_VOICEtaxonomy-rules.jsonTypes (TaxonomyContentRule, TaxonomyVocabulary, TaxonomyVoice)
packages/config/src/model-registry.tsMODEL_REGISTRY, DEFAULT_TIER_CONFIGmodels.jsonTypes, getModelEntry(), isDeprecated()
scripts/seed/generate-curated-entries.tsCATEGORY_SPECS (~350 lines)categories.jsonScript logic, AI generation, DB insertion
docs/projects/seeding/SEED.mdAll YAML control blocksseed-controls.jsonProse documentation (references JSON)
docs/APP-CONTROL.mdAll YAML control blocksapp-controls.jsonProse documentation (references JSON)

SEED.md and APP-CONTROL.md

These markdown files are replaced as config sources. They become lightweight documentation that references the JSON:

# Seeding Control

Configuration: `config/csv/seed-controls.csv`
Generated: `config/generated/seed-controls.json`
Sync: `bun run config:sync --only seed-controls`

One-Time Migration: Export + Verify

Export Phase

Export scripts read existing TypeScript constants and write pre-populated CSVs:

Export scriptReadsWrites
export-categories.tsCATEGORY_SPECS in generate-curated-entries.tscategories.csv
export-models.tsMODEL_REGISTRY + DEFAULT_TIER_CONFIGmodels.csv, model-tiers.csv
export-challenge.tsFORMAT_VOICE, STYLE_RULES, STYLE_VOICE, DIFFICULTY_LEVELS, BANNED_PATTERNS, etc.6 CSV files + long-text files
export-taxonomy.tsTAXONOMY_CONTENT_RULES + TAXONOMY_VOICE7 CSV files
export-seed-controls.tsYAML blocks from SEED.mdseed-controls.csv
export-app-controls.tsYAML blocks from APP-CONTROL.md5 CSV files

Verification Phase

Round-trip verification proves the migration is lossless:

bun run config:export    # Existing constants → CSV
bun run config:sync      # CSV → JSON
bun run config:verify    # JSON === original constants (deep equality)

config:verify performs deep equality comparison between generated JSON and original TypeScript constants. Mismatches report the exact path:

✗ taxonomy-rules.json → sports.vocabulary.domain_terms[2].use
  Expected: "a game-ending play by the home team in baseball"
  Got:      "a game-ending play by the home team"

After verification passes, hardcoded data is removed from .ts files and replaced with JSON imports. Export and verify scripts become migration scaffolding (can be kept or removed).

CI Integration

Add to the existing CI pipeline alongside migrations:check:

bun run config:check    # Fails if any CSV has changed without regenerating JSON

Uses SHA-256 checksums of CSV file contents compared against the committed JSON. Same pattern as bun run migrations:check.

Domain Inventory

Categories (1 CSV, ~156 rows)

30+ root topic slugs with 100+ subcategories. Each row: slug, subcategory, count, prompt.

Models (2 CSVs, ~35 rows)

31 AI models with provider, status, pricing. Separate CSV for 3 tier configs.

Challenge Rules (6 CSVs + 2 long-text files, ~120 rows)

8 format voices, 8 format rules, 6 style rules, 6 style voices, 5 difficulty levels, 10+ banned patterns. Long-text files for voice constitution and tone prefix.

Taxonomy Rules (7 CSVs, ~300 rows)

32 taxonomies with extraction guidance, challenge guidance, avoid patterns, vocabulary (domain terms, expert phrases, prefer-over mappings), and voice definitions.

Seed Controls (1 CSV, ~30 rows)

Key-value pairs for seeding mode, topic directives, volume controls, quality controls, execution controls, news controls.

App Controls (5 CSVs, ~90 rows)

Crons, workers, worker-queue subscriptions, queue definitions, environment variable controls.

Success Criteria

  1. bun run config:verify passes — round-trip is lossless
  2. All existing tests pass with JSON-backed constants
  3. bun run config:check integrated into CI
  4. Every CSV is pre-populated with current codebase values
  5. Editing a CSV + running bun run config:sync produces correct JSON
  6. Git diffs on CSV files are readable and reviewable