Brand Library Master Plan (V1)
https://chatgpt.com/share/6941adab-a9c8-8007-bd25-1b77e17dabee https://chatgpt.com/share/6941aff1-8c30-8007-a50f-cead37539c87
Path: /docs/dev/brand-library-master-plan.md
Purpose
This document defines the V1 master plan for building Eko’s Brand Library: a scalable, auditable system for collecting basic brand identity data and high-signal tracked URLs, then seeding them into the Eko database.
The Brand Library exists to:
- Power Tracking Suggestions
- Improve onboarding speed and trust
- Provide consistent brand context for URL change summaries
- Avoid crawling, scraping, or site-wide inference
Guiding Principles
-
Entity-first, URL-second
- Identify who the brand is before deciding what to track
-
Deterministic before generative
- Rules, heuristics, and validation precede LLM usage
-
URL-scoped, non-substitutive
- Summaries describe purpose and signal value, not content reproduction
-
Confidence-aware by default
- Every derived field carries an explicit confidence level
-
Idempotent & repeatable
- Seed runs can be safely re-executed without data corruption
High-Level Architecture
Two parallel pipelines converge on a shared brand_id:
- Brand Identity Pipeline — Who is this brand?
- URL Signal Pipeline — What pages are worth tracking?
[ Identity Sources ] ─┐
├─► brand_id ◄─┐
[ URL Discovery ] ────┘ │
▼
Seeded Brand Library
Core Data Model (V1)
brand_library_sources
Tracks provenance of all ingested data.
idnamesource_url?license?notes?
brand_library_brands
Represents a canonical brand entity.
idbrand_namecanonical_domain(unique)domain_aliases[]?category_pathhq_city?hq_region?hq_country?business_summary?(max 2 sentences)business_model?(SaaS, retailer, marketplace, etc.)audience?(B2B, B2C, hybrid)logo_url?confidence_identity(high | medium | low)confidence_category(high | medium | low)source_refs(json)
brand_library_urls
Represents a validated, trackable URL for a brand.
idbrand_idurl_type(pricing | status | privacy | terms | security)tracked_urlfinal_urlhttp_statustitle?h1?why_track(1–2 sentences)summary(2–3 sentences)confidence_url_match(high | medium | low)confidence_summary(high | medium | low)source_refs(json)
brand_library_review_queue
Centralized queue for ambiguity and low-confidence items.
identity_type(brand | url)entity_idreason(enum)details(json)status(open | fixed | ignored)- timestamps
Pipeline A: Brand Identity Pipeline
A1 — Inputs
- Structured company datasets (e.g., PDL)
- Secondary datasets (e.g., Kaggle)
- Brand enrichment API (Brandfetch)
A2 — Normalize & Dedupe
Rules:
- Canonicalize domains (scheme-less, lowercase, strip
www) - Dedupe on
canonical_domain - Preserve alternate domains as aliases
- Flag suspicious hosts (
github.io,notion.site, URL shorteners)
Review reasons:
DOMAIN_AMBIGUOUSDOMAIN_MISSINGCONFLICTING_DOMAINS
A3 — Merge Identity Facts
Precedence:
- Structured datasets → HQ region/country
- Brandfetch → logo, display name
- LLM classification → business model & audience
Conflict handling:
- Conflicting HQ → null + review
- Conflicting names → choose dataset value, keep others as aliases
A4 — Business Summary (LLM-Assisted)
Inputs to model:
- brand_name
- canonical_domain
- industry/category hints
- Brandfetch description (if available)
- dataset description (if available)
Output:
business_summary(≤2 sentences, neutral)business_modelaudienceconfidence_identity
Constraints:
- No marketing language
- No market claims or superlatives
- Hedge uncertainty explicitly
Review reasons:
SUMMARY_LOW_CONFIDENCESUMMARY_TOO_MARKETINGINSUFFICIENT_INPUTS
Pipeline B: URL Signal Pipeline
Supported URL Types (V1)
- pricing
- status
- privacy
- terms
- security (optional but recommended for B2B)
B1 — Generate URL Candidates
Rule-based generation per domain:
- pricing:
/pricing,/plans - status:
status.{domain},/status - privacy:
/privacy,/privacy-policy - terms:
/terms,/terms-of-service - security:
/security,/trust,/security-and-privacy
B2 — Validate & Select Winner
For each candidate:
- Resolve redirects
- Capture
final_urland HTTP status - Extract
<title>and first<h1>if cheap - Reject URLs with tracking/session parameters
Selection heuristics:
- Keyword match in title/h1
- Short, stable paths preferred
- Same-domain final URL
Review reasons:
NO_MATCH_FOR_URL_TYPEMULTIPLE_STRONG_CANDIDATESOFF_DOMAIN_REDIRECTURL_TYPE_MISMATCH
B3 — URL Summary (LLM-Assisted)
Inputs to model:
- url_type
- final_url
- title/h1
- small cleaned snippet (capped)
Outputs:
why_track(1–2 sentences, change-focused)summary(2–3 sentences, purpose-oriented)confidence_summary
Hard rules:
- No quoting long text
- No pricing tables or policy clauses
- No inference beyond provided signals
Acceptance Criteria (V1)
Brand is seed-ready if:
- canonical_domain present
- category confidence ≠ low
- ≥2 validated URL types
- identity confidence ≠ low (or explicitly marked incomplete)
URL is seed-ready if:
- final_url stable and allowed
- url_type match confidence ≠ low
- summary confidence ≠ low (or flagged for review)
Execution Phases
Phase 0 — One-Time Setup
- Define taxonomy (
categories.v1.json) - Define URL types (
url_types.v1.json) - Create DB tables + enums
- Version prompt templates
Phase 1 — Identity MVP
- 2k–5k brands
- Logos + summaries
- Review queue operational
Phase 2 — URL Bundles MVP
- Pricing + privacy + terms (+ status for SaaS)
- Coverage metrics
Phase 3 — Scale & Harden
- Runbooks
- QA sampling
- Coverage dashboards
Tooling Responsibilities
Claude Code
- Ingestion and normalization scripts
- URL validation logic
- Review queue workflow
- Seed run reports
ChatGPT
- Taxonomy design
- Prompt authoring
- QA heuristics and thresholds
- Documentation
Metrics to Track
- % brands with confirmed canonical domains
- Category confidence distribution
- Avg URL types per brand
- % brands seed-ready
- Top review reasons
- QA sampling accuracy
Outcome
This plan produces a clean, trustworthy Brand Library that:
- Scales without crawling
- Aligns with Eko’s URL-scoped intelligence model
- Supports onboarding, discovery, and future automation
- Makes uncertainty explicit instead of hiding it
This document is the source of truth for V1 Brand Library seeding.
Notes
Done. The V1 Brand Library Master Plan is now created at:
/docs/dev/brand-library-master-plan.md
This is a solid “source of truth” doc: scoped, enforceable, and implementation-ready for Claude Code without ambiguity.
What I recommend next (in order)
-
Lock the taxonomy
- Create
categories.v1.json(allowed category paths only). - This prevents drift once seeding starts.
- Create
-
Define prompt contracts
- Extract the two LLM prompts (brand categorization, URL summary) into versioned files.
- Treat them like APIs.
-
Create a seed-run checklist
- A short operational doc: “If a seed run looks wrong, check these 7 things first.”
-
Implement Identity → URL convergence
- Start with Identity-only runs to validate brand quality before URL work.
If you want, next I can:
- Draft
categories.v1.json(starter taxonomy, consumer + business) - Write the exact prompt files Claude Code should load
- Design a V1 seed-run QA report template (what shows up in
/reports/)
Just tell me which one to tackle.