Brand Library Master Plan (V1)

https://chatgpt.com/share/6941adab-a9c8-8007-bd25-1b77e17dabee https://chatgpt.com/share/6941aff1-8c30-8007-a50f-cead37539c87

Path: /docs/dev/brand-library-master-plan.md

Purpose

This document defines the V1 master plan for building Eko’s Brand Library: a scalable, auditable system for collecting basic brand identity data and high-signal tracked URLs, then seeding them into the Eko database.

The Brand Library exists to:

Power Tracking Suggestions
Improve onboarding speed and trust
Provide consistent brand context for URL change summaries
Avoid crawling, scraping, or site-wide inference

Guiding Principles

Entity-first, URL-second
- Identify who the brand is before deciding what to track
Deterministic before generative
- Rules, heuristics, and validation precede LLM usage
URL-scoped, non-substitutive
- Summaries describe purpose and signal value, not content reproduction
Confidence-aware by default
- Every derived field carries an explicit confidence level
Idempotent & repeatable
- Seed runs can be safely re-executed without data corruption

High-Level Architecture

Two parallel pipelines converge on a shared brand_id:

Brand Identity Pipeline — Who is this brand?
URL Signal Pipeline — What pages are worth tracking?

[ Identity Sources ] ─┐
                      ├─► brand_id ◄─┐
[ URL Discovery ] ────┘               │
                                      ▼
                              Seeded Brand Library

Core Data Model (V1)

brand_library_sources

Tracks provenance of all ingested data.

id
name
source_url?
license?
notes?

brand_library_brands

Represents a canonical brand entity.

id
brand_name
canonical_domain (unique)
domain_aliases[]?
category_path
hq_city?
hq_region?
hq_country?
business_summary? (max 2 sentences)
business_model? (SaaS, retailer, marketplace, etc.)
audience? (B2B, B2C, hybrid)
logo_url?
confidence_identity (high | medium | low)
confidence_category (high | medium | low)
source_refs (json)

brand_library_urls

Represents a validated, trackable URL for a brand.

id
brand_id
url_type (pricing | status | privacy | terms | security)
tracked_url
final_url
http_status
title?
h1?
why_track (1–2 sentences)
summary (2–3 sentences)
confidence_url_match (high | medium | low)
confidence_summary (high | medium | low)
source_refs (json)

brand_library_review_queue

Centralized queue for ambiguity and low-confidence items.

id
entity_type (brand | url)
entity_id
reason (enum)
details (json)
status (open | fixed | ignored)
timestamps

Pipeline A: Brand Identity Pipeline

A1 — Inputs

Structured company datasets (e.g., PDL)
Secondary datasets (e.g., Kaggle)
Brand enrichment API (Brandfetch)

A2 — Normalize & Dedupe

Rules:

Canonicalize domains (scheme-less, lowercase, strip www)
Dedupe on canonical_domain
Preserve alternate domains as aliases
Flag suspicious hosts (github.io, notion.site, URL shorteners)

Review reasons:

DOMAIN_AMBIGUOUS
DOMAIN_MISSING
CONFLICTING_DOMAINS

A3 — Merge Identity Facts

Precedence:

Structured datasets → HQ region/country
Brandfetch → logo, display name
LLM classification → business model & audience

Conflict handling:

Conflicting HQ → null + review
Conflicting names → choose dataset value, keep others as aliases

A4 — Business Summary (LLM-Assisted)

Inputs to model:

brand_name
canonical_domain
industry/category hints
Brandfetch description (if available)
dataset description (if available)

Output:

business_summary (≤2 sentences, neutral)
business_model
audience
confidence_identity

Constraints:

No marketing language
No market claims or superlatives
Hedge uncertainty explicitly

Review reasons:

SUMMARY_LOW_CONFIDENCE
SUMMARY_TOO_MARKETING
INSUFFICIENT_INPUTS

Pipeline B: URL Signal Pipeline

Supported URL Types (V1)

pricing
status
privacy
terms
security (optional but recommended for B2B)

B1 — Generate URL Candidates

Rule-based generation per domain:

pricing: /pricing, /plans
status: status.{domain}, /status
privacy: /privacy, /privacy-policy
terms: /terms, /terms-of-service
security: /security, /trust, /security-and-privacy

B2 — Validate & Select Winner

For each candidate:

Resolve redirects
Capture final_url and HTTP status
Extract <title> and first <h1> if cheap
Reject URLs with tracking/session parameters

Selection heuristics:

Keyword match in title/h1
Short, stable paths preferred
Same-domain final URL

Review reasons:

NO_MATCH_FOR_URL_TYPE
MULTIPLE_STRONG_CANDIDATES
OFF_DOMAIN_REDIRECT
URL_TYPE_MISMATCH

B3 — URL Summary (LLM-Assisted)

Inputs to model:

url_type
final_url
title/h1
small cleaned snippet (capped)

Outputs:

why_track (1–2 sentences, change-focused)
summary (2–3 sentences, purpose-oriented)
confidence_summary

Hard rules:

No quoting long text
No pricing tables or policy clauses
No inference beyond provided signals

Acceptance Criteria (V1)

Brand is seed-ready if:

canonical_domain present
category confidence ≠ low
≥2 validated URL types
identity confidence ≠ low (or explicitly marked incomplete)

URL is seed-ready if:

final_url stable and allowed
url_type match confidence ≠ low
summary confidence ≠ low (or flagged for review)

Execution Phases

Phase 0 — One-Time Setup

Define taxonomy (categories.v1.json)
Define URL types (url_types.v1.json)
Create DB tables + enums
Version prompt templates

Phase 1 — Identity MVP

2k–5k brands
Logos + summaries
Review queue operational

Phase 2 — URL Bundles MVP

Pricing + privacy + terms (+ status for SaaS)
Coverage metrics

Phase 3 — Scale & Harden

Runbooks
QA sampling
Coverage dashboards

Tooling Responsibilities

Claude Code

Ingestion and normalization scripts
URL validation logic
Review queue workflow
Seed run reports

ChatGPT

Taxonomy design
Prompt authoring
QA heuristics and thresholds
Documentation

Metrics to Track

% brands with confirmed canonical domains
Category confidence distribution
Avg URL types per brand
% brands seed-ready
Top review reasons
QA sampling accuracy

Outcome

This plan produces a clean, trustworthy Brand Library that:

Scales without crawling
Aligns with Eko’s URL-scoped intelligence model
Supports onboarding, discovery, and future automation
Makes uncertainty explicit instead of hiding it

This document is the source of truth for V1 Brand Library seeding.

Notes

Done. The V1 Brand Library Master Plan is now created at:

/docs/dev/brand-library-master-plan.md

This is a solid “source of truth” doc: scoped, enforceable, and implementation-ready for Claude Code without ambiguity.

Lock the taxonomy
- Create categories.v1.json (allowed category paths only).
- This prevents drift once seeding starts.
Define prompt contracts
- Extract the two LLM prompts (brand categorization, URL summary) into versioned files.
- Treat them like APIs.
Create a seed-run checklist
- A short operational doc: “If a seed run looks wrong, check these 7 things first.”
Implement Identity → URL convergence
- Start with Identity-only runs to validate brand quality before URL work.

If you want, next I can:

Draft categories.v1.json (starter taxonomy, consumer + business)
Write the exact prompt files Claude Code should load
Design a V1 seed-run QA report template (what shows up in /reports/)

Just tell me which one to tackle.

#Brand Library Master Plan (V1)

#Purpose

#Guiding Principles

#High-Level Architecture

#Core Data Model (V1)

#brand_library_sources

#brand_library_brands

#brand_library_urls

#brand_library_review_queue

#Pipeline A: Brand Identity Pipeline

#A1 — Inputs

#A2 — Normalize & Dedupe

#A3 — Merge Identity Facts

#A4 — Business Summary (LLM-Assisted)

#Pipeline B: URL Signal Pipeline

#Supported URL Types (V1)

#B1 — Generate URL Candidates

#B2 — Validate & Select Winner

#B3 — URL Summary (LLM-Assisted)

#Acceptance Criteria (V1)

#Brand is seed-ready if:

#URL is seed-ready if:

#Execution Phases

#Phase 0 — One-Time Setup

#Phase 1 — Identity MVP

#Phase 2 — URL Bundles MVP

#Phase 3 — Scale & Harden

#Tooling Responsibilities

#Claude Code

#ChatGPT

#Metrics to Track

#Outcome

#Notes

#What I recommend next (in order)

Brand Library Master Plan (V1)

Purpose

Guiding Principles

High-Level Architecture

Core Data Model (V1)

brand_library_sources

brand_library_brands

brand_library_urls

brand_library_review_queue

Pipeline A: Brand Identity Pipeline

A1 — Inputs

A2 — Normalize & Dedupe

A3 — Merge Identity Facts

A4 — Business Summary (LLM-Assisted)

Pipeline B: URL Signal Pipeline

Supported URL Types (V1)

B1 — Generate URL Candidates

B2 — Validate & Select Winner

B3 — URL Summary (LLM-Assisted)

Acceptance Criteria (V1)

Brand is seed-ready if:

URL is seed-ready if:

Execution Phases

Phase 0 — One-Time Setup

Phase 1 — Identity MVP

Phase 2 — URL Bundles MVP

Phase 3 — Scale & Harden

Tooling Responsibilities

Claude Code

ChatGPT

Metrics to Track

Outcome

Notes

What I recommend next (in order)