{{TITLE}}

Operational runbook for diagnosing and resolving issues with [component].


Overview

Brief description of what this runbook covers and when to use it.

Owner: [Agent or team responsible]

Escalation Path: [Who to contact if this runbook doesn't resolve the issue]


Quick Reference

MetricHealthyWarningCritical
Example metric< 100ms100-500ms> 500ms

Diagnostic Decision Tree

Is the service responding?
├── No → Check if container is running
│   ├── Container down → Restart container
│   └── Container up → Check logs for errors
└── Yes → Check response times
    ├── Slow → Investigate database/external dependencies
    └── Normal → Check for specific error patterns

Common Issues

Issue 1: [Problem Description]

Symptoms:

  • Observable symptom 1
  • Observable symptom 2

Cause: Root cause explanation

Resolution:

  1. Step 1
  2. Step 2
  3. Step 3

Prevention: How to prevent this in the future


Issue 2: [Problem Description]

Symptoms:

  • Observable symptom 1

Cause: Root cause explanation

Resolution:

  1. Step 1
  2. Step 2

Monitoring & Alerts

AlertThresholdAction
Alert nameConditionWhat to do

Recovery Procedures

Full Service Restart

# Commands to restart the service

Rollback Procedure

# Commands to rollback

Post-Incident Checklist

  • Confirm service is healthy
  • Verify no data loss
  • Update monitoring if needed
  • Document any new learnings
  • Schedule follow-up if root cause unclear