{{TITLE}}
Operational runbook for diagnosing and resolving issues with [component].
Overview
Brief description of what this runbook covers and when to use it.
Owner: [Agent or team responsible]
Escalation Path: [Who to contact if this runbook doesn't resolve the issue]
Quick Reference
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Example metric | < 100ms | 100-500ms | > 500ms |
Diagnostic Decision Tree
Is the service responding?
├── No → Check if container is running
│ ├── Container down → Restart container
│ └── Container up → Check logs for errors
└── Yes → Check response times
├── Slow → Investigate database/external dependencies
└── Normal → Check for specific error patterns
Common Issues
Issue 1: [Problem Description]
Symptoms:
- Observable symptom 1
- Observable symptom 2
Cause: Root cause explanation
Resolution:
- Step 1
- Step 2
- Step 3
Prevention: How to prevent this in the future
Issue 2: [Problem Description]
Symptoms:
- Observable symptom 1
Cause: Root cause explanation
Resolution:
- Step 1
- Step 2
Monitoring & Alerts
| Alert | Threshold | Action |
|---|---|---|
| Alert name | Condition | What to do |
Recovery Procedures
Full Service Restart
# Commands to restart the service
Rollback Procedure
# Commands to rollback
Post-Incident Checklist
- Confirm service is healthy
- Verify no data loss
- Update monitoring if needed
- Document any new learnings
- Schedule follow-up if root cause unclear