Tracker Runbook
Purpose: Fetch a single URL, normalize content, and produce stable section hashes suitable for diffing.
First Diagnostic Question
Is the input wrong? (fetch, DOM, normalization)
If yes, stop here and resolve tracker issues before any downstream investigation.
Inputs
url_check(URL, cadence, last_seen_hashes)- Fetch config (timeouts, UA, headers)
Process Steps
1. Fetch
- Use a consistent UA and timeout budget
- Respect robots/blocks at request level (do not bypass)
- Capture HTTP status and response size
2. Normalize
- Parse DOM (server-side, no JS execution)
- Remove boilerplate (nav, footer, cookie banners where detectable)
- Canonicalize whitespace and attribute order
3. Sectioning
- Segment by semantic headers (
h1-h4) and stable containers - Assign deterministic IDs per section (path-based + ordinal)
4. Hashing
- Hash normalized text per section (content-only)
- Produce ordered
section_hashes[]
5. Persist
- Store latest hashes and minimal metadata (status, bytes, timing)
Diagnostic Decision Tree
Tracker issue suspected
│
├─ Fetch failing?
│ ├─ HTTP 4xx → Check URL validity, access restrictions
│ ├─ HTTP 5xx → Site down, retry later
│ ├─ Timeout → Increase timeout budget or flag as slow
│ └─ Network error → Check connectivity, DNS
│
├─ DOM parsing failing?
│ ├─ Empty document → Site may require JS rendering
│ ├─ Malformed HTML → Parser tolerance issue
│ └─ Encoding error → Check charset detection
│
├─ Normalization unstable?
│ ├─ Section count varies between runs → Boilerplate detection issue
│ ├─ Hashes change without visible content change → Whitespace/attribute normalization
│ └─ Sections missing → Container detection failure
│
└─ Hashing inconsistent?
├─ Same content, different hash → Normalization not deterministic
└─ Hash collision → Increase hash length or algorithm
Common Failure Scenarios
Fetch Failures
Symptoms:
- HTTP errors (4xx, 5xx)
- Timeouts
- Connection refused
Actions:
- Verify URL is still valid and accessible
- Check if site is blocking our UA
- Review timeout configuration
- If persistent, mark URL as unreachable and notify user
Unstable Section Hashes
Symptoms:
- Section count swings between checks
- Hashes change without meaningful content change
- False positives downstream
Actions:
- Inspect raw HTML for dynamic content (timestamps, session IDs)
- Review boilerplate removal rules
- Add URL-specific normalization rules if needed
- Consider marking as "unstable" to suppress noisy alerts
JS-Dependent Content
Symptoms:
- Empty or minimal content returned
- Content doesn't match what browser shows
Actions:
- Confirm site requires JavaScript for content
- Flag for potential render-based fetching (future capability)
- Document limitation for user
Stop Conditions
Hard Stop
Trigger immediately if any are true:
- Fetch or normalization failures corrupt section hashes at scale
- Parser errors produce empty or unstable DOM output across many URLs
Action: Pause tracker jobs for affected URLs and suppress downstream stages.
Degrade Mode
- Record HTTP status and metadata only
- Skip section hashing and diffing for the affected run
Resume only after sample fetches normalize.
Signals to Watch
| Signal | Indicates |
|---|---|
| Sudden section count swings | Boilerplate detection instability |
| Large text deltas with identical layout | Normalization issue |
| Fetch error rate spike | Upstream site issues or rate limiting |
| Timeout rate increase | Performance degradation or blocking |
Database Queries
Check recent fetch status for a URL
SELECT url, status, created_at, response_size_bytes
FROM url_checks
WHERE url_id = '<url_id>'
ORDER BY created_at DESC
LIMIT 10;
Find URLs with high fetch error rates
SELECT url_id, COUNT(*) as errors
FROM url_checks
WHERE status >= 400
AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY url_id
ORDER BY errors DESC
LIMIT 20;
Related Runbooks
- Incident Playbook - Master triage
- Render - If delta is wrong after tracker succeeds
- Queue - If tracker jobs aren't running