CSV → JSON Config Pipeline
Summary
Replace hardcoded TypeScript constants and YAML-in-markdown control files with a CSV → JSON pipeline. CSVs become the source of truth for all structured configuration. Individual sync scripts validate and compile CSVs to typed JSON. Existing TypeScript files become thin wrappers that import JSON and re-export the same typed constants — zero import changes across the codebase.
Motivation
~3,500 lines of structured configuration are currently spread across:
- Embedded TypeScript constants (
challenge-content-rules.ts,taxonomy-content-rules.ts,model-registry.ts,generate-curated-entries.ts) - YAML blocks inside markdown (
SEED.md,APP-CONTROL.md)
This makes config changes require code deploys, produces noisy diffs, and prevents grid-based editing. A CSV pipeline provides:
- Visual grid editing via VS Code CSV extensions or Google Sheets
- Clean git diffs — CSV line changes are readable and reviewable
- Import/export flexibility — bulk edits, paste from other sources, analysis snapshots
- Zod validation at sync time catches errors before they reach production
Architecture
File Layout
config/
csv/ ← Source of truth (committed to Git)
categories.csv ← slug, subcategory, count, prompt
models.csv ← modelId, provider, status, input_price, output_price, deprecation_note
model-tiers.csv ← tier, provider, model
challenge-voices.csv ← key, value (tone_prefix, voice_constitution)
format-voices.csv ← format_slug, personality, register, energy, excitement_driver, ...
format-rules.csv ← format_slug, setup, challenge, reveal_correct, reveal_wrong, ...
style-rules.csv ← style, setup, challenge, reveal_correct, reveal_wrong, correct_answer, style_data
style-voices.csv ← style, mechanics, personality, register, phrasing_examples
difficulty-levels.csv ← level, cognitive_demand, prompt_guidance, knowledge_scope, response_expectation
banned-patterns.csv ← pattern, description
taxonomy-rules.csv ← slug, extraction_guidance, challenge_guidance
taxonomy-avoid.csv ← slug, pattern
taxonomy-domain-terms.csv ← slug, term, use
taxonomy-expert-phrases.csv ← slug, phrase
taxonomy-prefer-over.csv ← slug, instead_of, use
taxonomy-voices.csv ← slug, register, energy, excitement_driver
taxonomy-voice-pitfalls.csv ← slug, pitfall
seed-controls.csv ← section, key, value, description
app-controls/
crons.csv ← name, path, schedule, status, summary
workers.csv ← name, app, status, health_port
worker-queues.csv ← worker, queue_name, summary
queues.csv ← name, consumer, status, trigger
env-controls.csv ← section, key, value, required, description
long-text/ ← Multi-paragraph text values
voice-constitution.txt
super-fact-rules.txt
generated/ ← Derived artifacts (committed to Git)
categories.json
models.json
challenge-rules.json
taxonomy-rules.json
seed-controls.json
app-controls.json
Nested Data Strategy: Companion CSVs
Nested arrays are expressed as separate CSV files joined by a foreign key column.
Example — Taxonomy rules with vocabulary:
taxonomy-rules.csv (primary):
slug,extraction_guidance,challenge_guidance
sports,"Always list scores with the higher score first...","Lead with the winner..."
taxonomy-domain-terms.csv (companion, FK = slug):
slug,term,use
sports,roster depth,how many quality players a team has beyond starters
sports,triple-double,double digits in three statistical categories in one game
The sync script groups companion rows by slug and nests them into the primary record's JSON output.
Long-Text Fields: @file References
CSV cells containing @file:path/to/file.txt are resolved at sync time by reading the referenced file inline. This keeps CSVs clean while allowing multi-paragraph text like voice constitutions to live in dedicated editable files.
key,value
tone_prefix,"VOICE: You are writing for Eko..."
voice_constitution,@file:config/csv/long-text/voice-constitution.txt
Sync Scripts
Individual scripts per domain group, sharing common utilities.
Script Layout
scripts/config/
sync.ts ← Orchestrator: runs all domain syncs
sync-categories.ts ← categories.csv → categories.json
sync-models.ts ← models.csv + model-tiers.csv → models.json
sync-taxonomy.ts ← taxonomy-*.csv (7 files) → taxonomy-rules.json
sync-challenge.ts ← format-voices, style-rules, etc. → challenge-rules.json
sync-seed-controls.ts ← seed-controls.csv → seed-controls.json
sync-app-controls.ts ← app-controls/*.csv → app-controls.json
lib/
csv-reader.ts ← Parse CSV + row-level error reporting
join.ts ← Companion CSV foreign-key joining
validate.ts ← Zod validation with file:line:column errors
checksum.ts ← Staleness detection for CI
file-ref.ts ← @file: reference resolution
Shared Library
- csv-reader.ts — Wraps
csv-parsewith typed output, tracks source file + row numbers for error reporting - join.ts —
joinCompanions()groups companion rows by FK, nests them at dot-notation paths (e.g.,vocabulary.domain_terms), supportsflattenoption for single-value arrays - validate.ts —
validateRows()runs Zod.parse()per row with file:line:column error formatting - checksum.ts — SHA-256 of CSV contents vs JSON contents for CI staleness detection
- file-ref.ts — Detects
@file:prefixes, resolves relative paths, reads file content inline
CLI Commands
# Day-to-day
bun run config:sync # Parse + validate + write all JSON
bun run config:sync --only taxonomy # Sync one domain
bun run config:sync --dry-run # Validate without writing
bun run config:check # CI: fail if JSON is stale vs CSV
# One-time migration
bun run config:export # Extract existing constants → CSV
bun run config:export --only taxonomy # Export one domain
bun run config:verify # Prove round-trip matches original constants
Integration: Thin Wrapper Pattern
Existing TypeScript files keep their types, interfaces, and functions. Only the hardcoded data is replaced with JSON imports. All existing imports across the codebase remain unchanged.
Before
// packages/ai/src/challenge-content-rules.ts
export const FORMAT_VOICE: Record<ChallengeFormatSlug, FormatVoice> = {
big_fan_of: {
personality: 'You are a fellow superfan...',
// ... hundreds of lines
}
}
After
// packages/ai/src/challenge-content-rules.ts
import raw from '../../../config/generated/challenge-rules.json'
// Types stay here (unchanged)
export type ChallengeStyle = 'multiple_choice' | ...
export interface FormatVoice { ... }
// Data comes from JSON, validated at import time
export const FORMAT_VOICE = z.record(
formatVoiceSchema
).parse(raw.format_voices)
// Functions stay here (unchanged)
export function validateChallengeContent(...) { ... }
Migration Map
| File | Data removed | JSON source | Keeps |
|---|---|---|---|
packages/ai/src/challenge-content-rules.ts | CHALLENGE_TONE_PREFIX, CHALLENGE_VOICE_CONSTITUTION, FORMAT_VOICE, FORMAT_RULES, STYLE_RULES, STYLE_VOICE, DIFFICULTY_LEVELS, BANNED_PATTERNS, CQ002_PREFIXES | challenge-rules.json | Types, interfaces, validation functions, CQ002_REGEX |
packages/ai/src/taxonomy-content-rules.ts | TAXONOMY_CONTENT_RULES, TAXONOMY_VOICE | taxonomy-rules.json | Types (TaxonomyContentRule, TaxonomyVocabulary, TaxonomyVoice) |
packages/config/src/model-registry.ts | MODEL_REGISTRY, DEFAULT_TIER_CONFIG | models.json | Types, getModelEntry(), isDeprecated() |
scripts/seed/generate-curated-entries.ts | CATEGORY_SPECS (~350 lines) | categories.json | Script logic, AI generation, DB insertion |
docs/projects/seeding/SEED.md | All YAML control blocks | seed-controls.json | Prose documentation (references JSON) |
docs/APP-CONTROL.md | All YAML control blocks | app-controls.json | Prose documentation (references JSON) |
SEED.md and APP-CONTROL.md
These markdown files are replaced as config sources. They become lightweight documentation that references the JSON:
# Seeding Control
Configuration: `config/csv/seed-controls.csv`
Generated: `config/generated/seed-controls.json`
Sync: `bun run config:sync --only seed-controls`
One-Time Migration: Export + Verify
Export Phase
Export scripts read existing TypeScript constants and write pre-populated CSVs:
| Export script | Reads | Writes |
|---|---|---|
export-categories.ts | CATEGORY_SPECS in generate-curated-entries.ts | categories.csv |
export-models.ts | MODEL_REGISTRY + DEFAULT_TIER_CONFIG | models.csv, model-tiers.csv |
export-challenge.ts | FORMAT_VOICE, STYLE_RULES, STYLE_VOICE, DIFFICULTY_LEVELS, BANNED_PATTERNS, etc. | 6 CSV files + long-text files |
export-taxonomy.ts | TAXONOMY_CONTENT_RULES + TAXONOMY_VOICE | 7 CSV files |
export-seed-controls.ts | YAML blocks from SEED.md | seed-controls.csv |
export-app-controls.ts | YAML blocks from APP-CONTROL.md | 5 CSV files |
Verification Phase
Round-trip verification proves the migration is lossless:
bun run config:export # Existing constants → CSV
bun run config:sync # CSV → JSON
bun run config:verify # JSON === original constants (deep equality)
config:verify performs deep equality comparison between generated JSON and original TypeScript constants. Mismatches report the exact path:
✗ taxonomy-rules.json → sports.vocabulary.domain_terms[2].use
Expected: "a game-ending play by the home team in baseball"
Got: "a game-ending play by the home team"
After verification passes, hardcoded data is removed from .ts files and replaced with JSON imports. Export and verify scripts become migration scaffolding (can be kept or removed).
CI Integration
Add to the existing CI pipeline alongside migrations:check:
bun run config:check # Fails if any CSV has changed without regenerating JSON
Uses SHA-256 checksums of CSV file contents compared against the committed JSON. Same pattern as bun run migrations:check.
Domain Inventory
Categories (1 CSV, ~156 rows)
30+ root topic slugs with 100+ subcategories. Each row: slug, subcategory, count, prompt.
Models (2 CSVs, ~35 rows)
31 AI models with provider, status, pricing. Separate CSV for 3 tier configs.
Challenge Rules (6 CSVs + 2 long-text files, ~120 rows)
8 format voices, 8 format rules, 6 style rules, 6 style voices, 5 difficulty levels, 10+ banned patterns. Long-text files for voice constitution and tone prefix.
Taxonomy Rules (7 CSVs, ~300 rows)
32 taxonomies with extraction guidance, challenge guidance, avoid patterns, vocabulary (domain terms, expert phrases, prefer-over mappings), and voice definitions.
Seed Controls (1 CSV, ~30 rows)
Key-value pairs for seeding mode, topic directives, volume controls, quality controls, execution controls, news controls.
App Controls (5 CSVs, ~90 rows)
Crons, workers, worker-queue subscriptions, queue definitions, environment variable controls.
Success Criteria
bun run config:verifypasses — round-trip is lossless- All existing tests pass with JSON-backed constants
bun run config:checkintegrated into CI- Every CSV is pre-populated with current codebase values
- Editing a CSV + running
bun run config:syncproduces correct JSON - Git diffs on CSV files are readable and reviewable