Eko — Incident Playbook
This document defines how to identify, triage, and resolve incidents in the Eko system.
An incident is any situation where Eko fails to meet its product guarantees: accuracy, reliability, or trust.
Incident Severity Levels
SEV-1 — Critical
User trust or core functionality is actively compromised.
Examples:
- False positives flooding users
- Meaningful changes not being detected across many URLs
- Summaries violating fair-use constraints
- System-wide worker failure
Response: Immediate
SEV-2 — Degraded
Partial loss of functionality or quality.
Examples:
- Delayed checks
- Increased false positives on a subset of URLs
- One worker type failing while others continue
Response: Same day
SEV-3 — Minor
Localized or low-impact issues.
Examples:
- Single URL behaving unexpectedly
- Intermittent fetch failures
- Cosmetic admin issues
Response: Next scheduled maintenance window
Incident Triage Checklist
When an incident is reported or detected:
-
Confirm scope
- Single URL or many?
- One user or systemic?
-
Classify severity (SEV-1 / SEV-2 / SEV-3)
-
Identify failure class
- Ingestion (fetch, clustering, images)
- Fact extraction (AI output, schema, cost)
- Validation (tier checks, confidence, status)
- Queue / scheduling
-
Stabilize first
- Pause affected jobs if needed
- Prefer suppression over noisy output
Fast Diagnostic Routing
Before deep investigation, ask one question:
- Is the input wrong? (fetch, clustering, images) → Ingestion worker
- Are facts wrong? (AI output, schema, cost) → Fact extraction worker
- Is verification wrong? (tier checks, confidence, status) → Validation worker
- Are jobs not flowing? (queues, retries, stuck) → Queue
This prevents debugging the wrong layer.
Common Failure Classes
False Positives
Symptoms:
- Users receive alerts for trivial or cosmetic changes
Actions:
- Verify against the meaningful change spec
- Inspect section hashes for instability
- Increase suppression thresholds if needed
Principle: Silence is preferable to noise.
Missed Changes
Symptoms:
- Known page changes not surfaced
Actions:
- Confirm change persisted between checks
- Review normalization rules
- Verify cadence alignment
Worker Failures
Symptoms:
- Checks not running
- Backlog in queues
Actions:
- Inspect queue depth
- Restart affected workers
- Confirm environment variables and credentials
Fair-Use Violations
Symptoms:
- Summaries contain excessive quotation
- Output feels like page replacement
Actions:
- Immediately halt affected summaries
- Review summarization prompts
- Reduce excerpt limits
Priority: Always treat as SEV-1
Communication Guidelines
- Be factual and calm
- Avoid speculation
- Acknowledge uncertainty when present
- Never blame monitored sites
User trust is preserved through transparency, not speed alone.
Post-Incident Review
After resolution:
- Document what happened
- Identify root cause
- Note prevention steps
- Update runbooks or specs if needed
Incidents are learning opportunities, not failures.
Final Rule
When in doubt:
Protect user trust first, completeness second.
Stop Conditions (Safety First)
Halt Immediately (Hard Stop)
Trigger if any are true:
- Summaries violate fair-use constraints at scale
- Output risks replacing the source page
- Storage or diff corruption is possible
- Retry storms threaten platform stability
Action: Pause affected workers and suppress output.
Degrade Gracefully
Use when harm can be contained:
- Ingestion unstable → skip downstream extraction
- Fact extraction unreliable → hold facts in draft status
- Validation suspect → suppress publication, keep facts gated
Resume only after spot-checking correctness.
Related Runbooks
Use these runbooks for deeper, component-level response once the incident is classified:
- Queue health issues (backlogs, retries, stuck jobs): queue.md
- Scheduling issues (cron, dispatch timing): scheduling.md
For V1 subsystem references (tracker, render, summarization), see the archived runbooks at docs/docs_archive/runbooks-v1/.