Brand Library Master Plan (V1)

https://chatgpt.com/share/6941adab-a9c8-8007-bd25-1b77e17dabee https://chatgpt.com/share/6941aff1-8c30-8007-a50f-cead37539c87

Path: /docs/dev/brand-library-master-plan.md

Purpose

This document defines the V1 master plan for building Eko’s Brand Library: a scalable, auditable system for collecting basic brand identity data and high-signal tracked URLs, then seeding them into the Eko database.

The Brand Library exists to:

  • Power Tracking Suggestions
  • Improve onboarding speed and trust
  • Provide consistent brand context for URL change summaries
  • Avoid crawling, scraping, or site-wide inference

Guiding Principles

  1. Entity-first, URL-second

    • Identify who the brand is before deciding what to track
  2. Deterministic before generative

    • Rules, heuristics, and validation precede LLM usage
  3. URL-scoped, non-substitutive

    • Summaries describe purpose and signal value, not content reproduction
  4. Confidence-aware by default

    • Every derived field carries an explicit confidence level
  5. Idempotent & repeatable

    • Seed runs can be safely re-executed without data corruption

High-Level Architecture

Two parallel pipelines converge on a shared brand_id:

  1. Brand Identity PipelineWho is this brand?
  2. URL Signal PipelineWhat pages are worth tracking?
[ Identity Sources ] ─┐
                      ├─► brand_id ◄─┐
[ URL Discovery ] ────┘               │
                                      ▼
                              Seeded Brand Library

Core Data Model (V1)

brand_library_sources

Tracks provenance of all ingested data.

  • id
  • name
  • source_url?
  • license?
  • notes?

brand_library_brands

Represents a canonical brand entity.

  • id
  • brand_name
  • canonical_domain (unique)
  • domain_aliases[]?
  • category_path
  • hq_city?
  • hq_region?
  • hq_country?
  • business_summary? (max 2 sentences)
  • business_model? (SaaS, retailer, marketplace, etc.)
  • audience? (B2B, B2C, hybrid)
  • logo_url?
  • confidence_identity (high | medium | low)
  • confidence_category (high | medium | low)
  • source_refs (json)

brand_library_urls

Represents a validated, trackable URL for a brand.

  • id
  • brand_id
  • url_type (pricing | status | privacy | terms | security)
  • tracked_url
  • final_url
  • http_status
  • title?
  • h1?
  • why_track (1–2 sentences)
  • summary (2–3 sentences)
  • confidence_url_match (high | medium | low)
  • confidence_summary (high | medium | low)
  • source_refs (json)

brand_library_review_queue

Centralized queue for ambiguity and low-confidence items.

  • id
  • entity_type (brand | url)
  • entity_id
  • reason (enum)
  • details (json)
  • status (open | fixed | ignored)
  • timestamps

Pipeline A: Brand Identity Pipeline

A1 — Inputs

  • Structured company datasets (e.g., PDL)
  • Secondary datasets (e.g., Kaggle)
  • Brand enrichment API (Brandfetch)

A2 — Normalize & Dedupe

Rules:

  • Canonicalize domains (scheme-less, lowercase, strip www)
  • Dedupe on canonical_domain
  • Preserve alternate domains as aliases
  • Flag suspicious hosts (github.io, notion.site, URL shorteners)

Review reasons:

  • DOMAIN_AMBIGUOUS
  • DOMAIN_MISSING
  • CONFLICTING_DOMAINS

A3 — Merge Identity Facts

Precedence:

  1. Structured datasets → HQ region/country
  2. Brandfetch → logo, display name
  3. LLM classification → business model & audience

Conflict handling:

  • Conflicting HQ → null + review
  • Conflicting names → choose dataset value, keep others as aliases

A4 — Business Summary (LLM-Assisted)

Inputs to model:

  • brand_name
  • canonical_domain
  • industry/category hints
  • Brandfetch description (if available)
  • dataset description (if available)

Output:

  • business_summary (≤2 sentences, neutral)
  • business_model
  • audience
  • confidence_identity

Constraints:

  • No marketing language
  • No market claims or superlatives
  • Hedge uncertainty explicitly

Review reasons:

  • SUMMARY_LOW_CONFIDENCE
  • SUMMARY_TOO_MARKETING
  • INSUFFICIENT_INPUTS

Pipeline B: URL Signal Pipeline

Supported URL Types (V1)

  • pricing
  • status
  • privacy
  • terms
  • security (optional but recommended for B2B)

B1 — Generate URL Candidates

Rule-based generation per domain:

  • pricing: /pricing, /plans
  • status: status.{domain}, /status
  • privacy: /privacy, /privacy-policy
  • terms: /terms, /terms-of-service
  • security: /security, /trust, /security-and-privacy

B2 — Validate & Select Winner

For each candidate:

  • Resolve redirects
  • Capture final_url and HTTP status
  • Extract <title> and first <h1> if cheap
  • Reject URLs with tracking/session parameters

Selection heuristics:

  • Keyword match in title/h1
  • Short, stable paths preferred
  • Same-domain final URL

Review reasons:

  • NO_MATCH_FOR_URL_TYPE
  • MULTIPLE_STRONG_CANDIDATES
  • OFF_DOMAIN_REDIRECT
  • URL_TYPE_MISMATCH

B3 — URL Summary (LLM-Assisted)

Inputs to model:

  • url_type
  • final_url
  • title/h1
  • small cleaned snippet (capped)

Outputs:

  • why_track (1–2 sentences, change-focused)
  • summary (2–3 sentences, purpose-oriented)
  • confidence_summary

Hard rules:

  • No quoting long text
  • No pricing tables or policy clauses
  • No inference beyond provided signals

Acceptance Criteria (V1)

Brand is seed-ready if:

  • canonical_domain present
  • category confidence ≠ low
  • ≥2 validated URL types
  • identity confidence ≠ low (or explicitly marked incomplete)

URL is seed-ready if:

  • final_url stable and allowed
  • url_type match confidence ≠ low
  • summary confidence ≠ low (or flagged for review)

Execution Phases

Phase 0 — One-Time Setup

  • Define taxonomy (categories.v1.json)
  • Define URL types (url_types.v1.json)
  • Create DB tables + enums
  • Version prompt templates

Phase 1 — Identity MVP

  • 2k–5k brands
  • Logos + summaries
  • Review queue operational

Phase 2 — URL Bundles MVP

  • Pricing + privacy + terms (+ status for SaaS)
  • Coverage metrics

Phase 3 — Scale & Harden

  • Runbooks
  • QA sampling
  • Coverage dashboards

Tooling Responsibilities

Claude Code

  • Ingestion and normalization scripts
  • URL validation logic
  • Review queue workflow
  • Seed run reports

ChatGPT

  • Taxonomy design
  • Prompt authoring
  • QA heuristics and thresholds
  • Documentation

Metrics to Track

  • % brands with confirmed canonical domains
  • Category confidence distribution
  • Avg URL types per brand
  • % brands seed-ready
  • Top review reasons
  • QA sampling accuracy

Outcome

This plan produces a clean, trustworthy Brand Library that:

  • Scales without crawling
  • Aligns with Eko’s URL-scoped intelligence model
  • Supports onboarding, discovery, and future automation
  • Makes uncertainty explicit instead of hiding it

This document is the source of truth for V1 Brand Library seeding.

Notes

Done. The V1 Brand Library Master Plan is now created at:

/docs/dev/brand-library-master-plan.md

This is a solid “source of truth” doc: scoped, enforceable, and implementation-ready for Claude Code without ambiguity.

What I recommend next (in order)

  1. Lock the taxonomy

    • Create categories.v1.json (allowed category paths only).
    • This prevents drift once seeding starts.
  2. Define prompt contracts

    • Extract the two LLM prompts (brand categorization, URL summary) into versioned files.
    • Treat them like APIs.
  3. Create a seed-run checklist

    • A short operational doc: “If a seed run looks wrong, check these 7 things first.”
  4. Implement Identity → URL convergence

    • Start with Identity-only runs to validate brand quality before URL work.

If you want, next I can:

  • Draft categories.v1.json (starter taxonomy, consumer + business)
  • Write the exact prompt files Claude Code should load
  • Design a V1 seed-run QA report template (what shows up in /reports/)

Just tell me which one to tackle.