Queue Runbook
Purpose: Keep workers flowing; resolve backlogs, retries, and stuck jobs.
First Diagnostic Question
Are jobs not flowing? (backlogs, retries, stuck leases)
If yes, resolve queue health before touching worker logic.
Queues
| Queue | Purpose |
|---|---|
ingest | News ingestion jobs (fetch, clustering, images) |
facts | Fact extraction jobs (AI extraction, evergreen) |
validate | Fact validation jobs (multi-tier verification) |
Normal Operations
- Enforce per-URL concurrency = 1
- Apply exponential backoff on transient failures
- Jobs should complete within timeout budget
Diagnostic Decision Tree
Queue issue suspected
│
├─ Jobs not starting?
│ ├─ Worker not running → Check deployment, restart worker
│ ├─ No jobs enqueued → Scheduler issue, check cron
│ └─ Jobs visible but not picked up → Worker connection issue
│
├─ Jobs stuck/not completing?
│ ├─ Timeout exceeded → Job taking too long, investigate worker
│ ├─ Lease not released → Worker crashed, needs cleanup
│ └─ Infinite retry loop → Permanent failure not handled
│
├─ Backlog growing?
│ ├─ Enqueue rate > drain rate → Scale workers or throttle intake
│ ├─ Jobs failing and retrying → Fix underlying failure
│ └─ Burst of new URLs → Expected, monitor drain rate
│
└─ Retry storms?
├─ Same job retrying rapidly → Backoff not applied
├─ Many jobs retrying → Systemic failure, pause and investigate
└─ External dependency down → Wait for recovery, pause retries
Common Failure Scenarios
Worker Not Processing
Symptoms:
- Queue depth increasing
- No job completions logged
- Worker appears idle
Actions:
- Verify worker process is running
- Check worker logs for errors
- Verify database/queue connection
- Restart worker if unresponsive
- Check environment variables and credentials
Stuck Jobs (Lease Not Released)
Symptoms:
- Jobs show as "in progress" but not completing
- Same job stuck for extended period
- No progress in worker logs
Actions:
- Check if worker crashed mid-job
- Release stuck leases manually if safe
- Mark job for retry or dead-letter
- Investigate cause of worker crash
- Add better timeout handling if needed
Retry Storms
Symptoms:
- Same jobs retrying rapidly
- High retry count on multiple jobs
- System load increasing
Actions:
- Identify root cause of failures
- Pause enqueues if storm threatens stability
- Verify exponential backoff is applied
- Cap maximum retries
- Dead-letter persistently failing jobs
Backlog Buildup
Symptoms:
- Queue depth growing over time
- Jobs completing but not fast enough
- User-visible delays
Actions:
- Identify if enqueue rate spiked (new users, bulk imports)
- Check if drain rate decreased (slow jobs, fewer workers)
- Scale workers if needed
- Throttle low-priority cadences
- Monitor until backlog clears
Stop Conditions
Hard Stop
Trigger immediately if any are true:
- Retry storms threaten system stability
- Queue depth grows faster than drain rate for extended period
Action: Pause enqueues and cap retries.
Degrade Mode
- Throttle low-priority cadences
- Allow high-priority jobs only
- Extend job timeouts if needed
Resume once backlog stabilizes.
Signals to Watch
| Signal | Indicates |
|---|---|
| Retry storms | Systemic failure or missing backoff |
| Queue depth vs cadence mismatch | Capacity issue |
| Job duration increasing | Worker performance issue |
| Lease expiration rate | Worker crashes or timeouts |
Monitoring Queries
Check queue depths
SELECT queue_name, status, COUNT(*) as job_count
FROM jobs
GROUP BY queue_name, status
ORDER BY queue_name, status;
Find stuck jobs (in progress too long)
SELECT id, queue_name, url_id, started_at,
NOW() - started_at as duration
FROM jobs
WHERE status = 'in_progress'
AND started_at < NOW() - INTERVAL '10 minutes'
ORDER BY started_at;
Check retry patterns
SELECT url_id, queue_name, retry_count, COUNT(*) as jobs
FROM jobs
WHERE retry_count > 0
AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY url_id, queue_name, retry_count
ORDER BY retry_count DESC
LIMIT 20;
Queue throughput (last hour)
SELECT queue_name,
COUNT(*) FILTER (WHERE status = 'completed') as completed,
COUNT(*) FILTER (WHERE status = 'failed') as failed,
AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_duration_sec
FROM jobs
WHERE completed_at > NOW() - INTERVAL '1 hour'
GROUP BY queue_name;
Related Runbooks
- Incident Playbook - Master triage
- Scheduling - If jobs aren't being enqueued on time