Eko — Incident Playbook

This document defines how to identify, triage, and resolve incidents in the Eko system.

An incident is any situation where Eko fails to meet its product guarantees: accuracy, reliability, or trust.

Incident Severity Levels

SEV-1 — Critical

User trust or core functionality is actively compromised.

Examples:

False positives flooding users
Meaningful changes not being detected across many URLs
Summaries violating fair-use constraints
System-wide worker failure

Response: Immediate

SEV-2 — Degraded

Partial loss of functionality or quality.

Examples:

Delayed checks
Increased false positives on a subset of URLs
One worker type failing while others continue

Response: Same day

SEV-3 — Minor

Localized or low-impact issues.

Examples:

Single URL behaving unexpectedly
Intermittent fetch failures
Cosmetic admin issues

Response: Next scheduled maintenance window

Incident Triage Checklist

When an incident is reported or detected:

Confirm scope
- Single URL or many?
- One user or systemic?
Classify severity (SEV-1 / SEV-2 / SEV-3)
Identify failure class
- Ingestion (fetch, clustering, images)
- Fact extraction (AI output, schema, cost)
- Validation (tier checks, confidence, status)
- Queue / scheduling
Stabilize first
- Pause affected jobs if needed
- Prefer suppression over noisy output

Fast Diagnostic Routing

Before deep investigation, ask one question:

Is the input wrong? (fetch, clustering, images) → Ingestion worker
Are facts wrong? (AI output, schema, cost) → Fact extraction worker
Is verification wrong? (tier checks, confidence, status) → Validation worker
Are jobs not flowing? (queues, retries, stuck) → Queue

This prevents debugging the wrong layer.

Common Failure Classes

False Positives

Symptoms:

Users receive alerts for trivial or cosmetic changes

Actions:

Verify against the meaningful change spec
Inspect section hashes for instability
Increase suppression thresholds if needed

Principle: Silence is preferable to noise.

Missed Changes

Symptoms:

Known page changes not surfaced

Actions:

Confirm change persisted between checks
Review normalization rules
Verify cadence alignment

Worker Failures

Symptoms:

Checks not running
Backlog in queues

Actions:

Inspect queue depth
Restart affected workers
Confirm environment variables and credentials

Fair-Use Violations

Symptoms:

Summaries contain excessive quotation
Output feels like page replacement

Actions:

Immediately halt affected summaries
Review summarization prompts
Reduce excerpt limits

Priority: Always treat as SEV-1

Communication Guidelines

Be factual and calm
Avoid speculation
Acknowledge uncertainty when present
Never blame monitored sites

User trust is preserved through transparency, not speed alone.

Post-Incident Review

After resolution:

Document what happened
Identify root cause
Note prevention steps
Update runbooks or specs if needed

Incidents are learning opportunities, not failures.

Final Rule

When in doubt:

Protect user trust first, completeness second.

Stop Conditions (Safety First)

Halt Immediately (Hard Stop)

Trigger if any are true:

Summaries violate fair-use constraints at scale
Output risks replacing the source page
Storage or diff corruption is possible
Retry storms threaten platform stability

Action: Pause affected workers and suppress output.

Degrade Gracefully

Use when harm can be contained:

Ingestion unstable → skip downstream extraction
Fact extraction unreliable → hold facts in draft status
Validation suspect → suppress publication, keep facts gated

Resume only after spot-checking correctness.

Use these runbooks for deeper, component-level response once the incident is classified:

Queue health issues (backlogs, retries, stuck jobs): queue.md
Scheduling issues (cron, dispatch timing): scheduling.md

For V1 subsystem references (tracker, render, summarization), see the archived runbooks at docs/docs_archive/runbooks-v1/.

#Eko — Incident Playbook

#Incident Severity Levels

#SEV-1 — Critical

#SEV-2 — Degraded

#SEV-3 — Minor

#Incident Triage Checklist

#Fast Diagnostic Routing

#Common Failure Classes

#False Positives

#Missed Changes

#Worker Failures

#Fair-Use Violations

#Communication Guidelines

#Post-Incident Review

#Final Rule

#Stop Conditions (Safety First)

#Halt Immediately (Hard Stop)

#Degrade Gracefully

#Related Runbooks

Eko — Incident Playbook

Incident Severity Levels

SEV-1 — Critical

SEV-2 — Degraded

SEV-3 — Minor

Incident Triage Checklist

Fast Diagnostic Routing

Common Failure Classes

False Positives

Missed Changes

Worker Failures

Fair-Use Violations

Communication Guidelines

Post-Incident Review

Final Rule

Stop Conditions (Safety First)

Halt Immediately (Hard Stop)

Degrade Gracefully

Related Runbooks