Tracker Runbook

Purpose: Fetch a single URL, normalize content, and produce stable section hashes suitable for diffing.


First Diagnostic Question

Is the input wrong? (fetch, DOM, normalization)

If yes, stop here and resolve tracker issues before any downstream investigation.


Inputs

  • url_check (URL, cadence, last_seen_hashes)
  • Fetch config (timeouts, UA, headers)

Process Steps

1. Fetch

  • Use a consistent UA and timeout budget
  • Respect robots/blocks at request level (do not bypass)
  • Capture HTTP status and response size

2. Normalize

  • Parse DOM (server-side, no JS execution)
  • Remove boilerplate (nav, footer, cookie banners where detectable)
  • Canonicalize whitespace and attribute order

3. Sectioning

  • Segment by semantic headers (h1-h4) and stable containers
  • Assign deterministic IDs per section (path-based + ordinal)

4. Hashing

  • Hash normalized text per section (content-only)
  • Produce ordered section_hashes[]

5. Persist

  • Store latest hashes and minimal metadata (status, bytes, timing)

Diagnostic Decision Tree

Tracker issue suspected
    │
    ├─ Fetch failing?
    │   ├─ HTTP 4xx → Check URL validity, access restrictions
    │   ├─ HTTP 5xx → Site down, retry later
    │   ├─ Timeout → Increase timeout budget or flag as slow
    │   └─ Network error → Check connectivity, DNS
    │
    ├─ DOM parsing failing?
    │   ├─ Empty document → Site may require JS rendering
    │   ├─ Malformed HTML → Parser tolerance issue
    │   └─ Encoding error → Check charset detection
    │
    ├─ Normalization unstable?
    │   ├─ Section count varies between runs → Boilerplate detection issue
    │   ├─ Hashes change without visible content change → Whitespace/attribute normalization
    │   └─ Sections missing → Container detection failure
    │
    └─ Hashing inconsistent?
        ├─ Same content, different hash → Normalization not deterministic
        └─ Hash collision → Increase hash length or algorithm

Common Failure Scenarios

Fetch Failures

Symptoms:

  • HTTP errors (4xx, 5xx)
  • Timeouts
  • Connection refused

Actions:

  1. Verify URL is still valid and accessible
  2. Check if site is blocking our UA
  3. Review timeout configuration
  4. If persistent, mark URL as unreachable and notify user

Unstable Section Hashes

Symptoms:

  • Section count swings between checks
  • Hashes change without meaningful content change
  • False positives downstream

Actions:

  1. Inspect raw HTML for dynamic content (timestamps, session IDs)
  2. Review boilerplate removal rules
  3. Add URL-specific normalization rules if needed
  4. Consider marking as "unstable" to suppress noisy alerts

JS-Dependent Content

Symptoms:

  • Empty or minimal content returned
  • Content doesn't match what browser shows

Actions:

  1. Confirm site requires JavaScript for content
  2. Flag for potential render-based fetching (future capability)
  3. Document limitation for user

Stop Conditions

Hard Stop

Trigger immediately if any are true:

  • Fetch or normalization failures corrupt section hashes at scale
  • Parser errors produce empty or unstable DOM output across many URLs

Action: Pause tracker jobs for affected URLs and suppress downstream stages.

Degrade Mode

  • Record HTTP status and metadata only
  • Skip section hashing and diffing for the affected run

Resume only after sample fetches normalize.


Signals to Watch

SignalIndicates
Sudden section count swingsBoilerplate detection instability
Large text deltas with identical layoutNormalization issue
Fetch error rate spikeUpstream site issues or rate limiting
Timeout rate increasePerformance degradation or blocking

Database Queries

Check recent fetch status for a URL

SELECT url, status, created_at, response_size_bytes
FROM url_checks
WHERE url_id = '<url_id>'
ORDER BY created_at DESC
LIMIT 10;

Find URLs with high fetch error rates

SELECT url_id, COUNT(*) as errors
FROM url_checks
WHERE status >= 400
  AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY url_id
ORDER BY errors DESC
LIMIT 20;