Eko — Incident Playbook

This document defines how to identify, triage, and resolve incidents in the Eko system.

An incident is any situation where Eko fails to meet its product guarantees: accuracy, reliability, or trust.


Incident Severity Levels

SEV-1 — Critical

User trust or core functionality is actively compromised.

Examples:

  • False positives flooding users
  • Meaningful changes not being detected across many URLs
  • Summaries violating fair-use constraints
  • System-wide worker failure

Response: Immediate


SEV-2 — Degraded

Partial loss of functionality or quality.

Examples:

  • Delayed checks
  • Increased false positives on a subset of URLs
  • One worker type failing while others continue

Response: Same day


SEV-3 — Minor

Localized or low-impact issues.

Examples:

  • Single URL behaving unexpectedly
  • Intermittent fetch failures
  • Cosmetic admin issues

Response: Next scheduled maintenance window


Incident Triage Checklist

When an incident is reported or detected:

  1. Confirm scope

    • Single URL or many?
    • One user or systemic?
  2. Classify severity (SEV-1 / SEV-2 / SEV-3)

  3. Identify failure class

    • Ingestion (fetch, clustering, images)
    • Fact extraction (AI output, schema, cost)
    • Validation (tier checks, confidence, status)
    • Queue / scheduling
  4. Stabilize first

    • Pause affected jobs if needed
    • Prefer suppression over noisy output

Fast Diagnostic Routing

Before deep investigation, ask one question:

  • Is the input wrong? (fetch, clustering, images) → Ingestion worker
  • Are facts wrong? (AI output, schema, cost) → Fact extraction worker
  • Is verification wrong? (tier checks, confidence, status) → Validation worker
  • Are jobs not flowing? (queues, retries, stuck) → Queue

This prevents debugging the wrong layer.


Common Failure Classes

False Positives

Symptoms:

  • Users receive alerts for trivial or cosmetic changes

Actions:

  • Verify against the meaningful change spec
  • Inspect section hashes for instability
  • Increase suppression thresholds if needed

Principle: Silence is preferable to noise.


Missed Changes

Symptoms:

  • Known page changes not surfaced

Actions:

  • Confirm change persisted between checks
  • Review normalization rules
  • Verify cadence alignment

Worker Failures

Symptoms:

  • Checks not running
  • Backlog in queues

Actions:

  • Inspect queue depth
  • Restart affected workers
  • Confirm environment variables and credentials

Fair-Use Violations

Symptoms:

  • Summaries contain excessive quotation
  • Output feels like page replacement

Actions:

  • Immediately halt affected summaries
  • Review summarization prompts
  • Reduce excerpt limits

Priority: Always treat as SEV-1


Communication Guidelines

  • Be factual and calm
  • Avoid speculation
  • Acknowledge uncertainty when present
  • Never blame monitored sites

User trust is preserved through transparency, not speed alone.


Post-Incident Review

After resolution:

  • Document what happened
  • Identify root cause
  • Note prevention steps
  • Update runbooks or specs if needed

Incidents are learning opportunities, not failures.


Final Rule

When in doubt:

Protect user trust first, completeness second.


Stop Conditions (Safety First)

Halt Immediately (Hard Stop)

Trigger if any are true:

  • Summaries violate fair-use constraints at scale
  • Output risks replacing the source page
  • Storage or diff corruption is possible
  • Retry storms threaten platform stability

Action: Pause affected workers and suppress output.

Degrade Gracefully

Use when harm can be contained:

  • Ingestion unstable → skip downstream extraction
  • Fact extraction unreliable → hold facts in draft status
  • Validation suspect → suppress publication, keep facts gated

Resume only after spot-checking correctness.


Use these runbooks for deeper, component-level response once the incident is classified:

  • Queue health issues (backlogs, retries, stuck jobs): queue.md
  • Scheduling issues (cron, dispatch timing): scheduling.md

For V1 subsystem references (tracker, render, summarization), see the archived runbooks at docs/docs_archive/runbooks-v1/.