Queue Runbook

Purpose: Keep workers flowing; resolve backlogs, retries, and stuck jobs.


First Diagnostic Question

Are jobs not flowing? (backlogs, retries, stuck leases)

If yes, resolve queue health before touching worker logic.


Queues

QueuePurpose
ingestNews ingestion jobs (fetch, clustering, images)
factsFact extraction jobs (AI extraction, evergreen)
validateFact validation jobs (multi-tier verification)

Normal Operations

  • Enforce per-URL concurrency = 1
  • Apply exponential backoff on transient failures
  • Jobs should complete within timeout budget

Diagnostic Decision Tree

Queue issue suspected
    │
    ├─ Jobs not starting?
    │   ├─ Worker not running → Check deployment, restart worker
    │   ├─ No jobs enqueued → Scheduler issue, check cron
    │   └─ Jobs visible but not picked up → Worker connection issue
    │
    ├─ Jobs stuck/not completing?
    │   ├─ Timeout exceeded → Job taking too long, investigate worker
    │   ├─ Lease not released → Worker crashed, needs cleanup
    │   └─ Infinite retry loop → Permanent failure not handled
    │
    ├─ Backlog growing?
    │   ├─ Enqueue rate > drain rate → Scale workers or throttle intake
    │   ├─ Jobs failing and retrying → Fix underlying failure
    │   └─ Burst of new URLs → Expected, monitor drain rate
    │
    └─ Retry storms?
        ├─ Same job retrying rapidly → Backoff not applied
        ├─ Many jobs retrying → Systemic failure, pause and investigate
        └─ External dependency down → Wait for recovery, pause retries

Common Failure Scenarios

Worker Not Processing

Symptoms:

  • Queue depth increasing
  • No job completions logged
  • Worker appears idle

Actions:

  1. Verify worker process is running
  2. Check worker logs for errors
  3. Verify database/queue connection
  4. Restart worker if unresponsive
  5. Check environment variables and credentials

Stuck Jobs (Lease Not Released)

Symptoms:

  • Jobs show as "in progress" but not completing
  • Same job stuck for extended period
  • No progress in worker logs

Actions:

  1. Check if worker crashed mid-job
  2. Release stuck leases manually if safe
  3. Mark job for retry or dead-letter
  4. Investigate cause of worker crash
  5. Add better timeout handling if needed

Retry Storms

Symptoms:

  • Same jobs retrying rapidly
  • High retry count on multiple jobs
  • System load increasing

Actions:

  1. Identify root cause of failures
  2. Pause enqueues if storm threatens stability
  3. Verify exponential backoff is applied
  4. Cap maximum retries
  5. Dead-letter persistently failing jobs

Backlog Buildup

Symptoms:

  • Queue depth growing over time
  • Jobs completing but not fast enough
  • User-visible delays

Actions:

  1. Identify if enqueue rate spiked (new users, bulk imports)
  2. Check if drain rate decreased (slow jobs, fewer workers)
  3. Scale workers if needed
  4. Throttle low-priority cadences
  5. Monitor until backlog clears

Stop Conditions

Hard Stop

Trigger immediately if any are true:

  • Retry storms threaten system stability
  • Queue depth grows faster than drain rate for extended period

Action: Pause enqueues and cap retries.

Degrade Mode

  • Throttle low-priority cadences
  • Allow high-priority jobs only
  • Extend job timeouts if needed

Resume once backlog stabilizes.


Signals to Watch

SignalIndicates
Retry stormsSystemic failure or missing backoff
Queue depth vs cadence mismatchCapacity issue
Job duration increasingWorker performance issue
Lease expiration rateWorker crashes or timeouts

Monitoring Queries

Check queue depths

SELECT queue_name, status, COUNT(*) as job_count
FROM jobs
GROUP BY queue_name, status
ORDER BY queue_name, status;

Find stuck jobs (in progress too long)

SELECT id, queue_name, url_id, started_at,
       NOW() - started_at as duration
FROM jobs
WHERE status = 'in_progress'
  AND started_at < NOW() - INTERVAL '10 minutes'
ORDER BY started_at;

Check retry patterns

SELECT url_id, queue_name, retry_count, COUNT(*) as jobs
FROM jobs
WHERE retry_count > 0
  AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY url_id, queue_name, retry_count
ORDER BY retry_count DESC
LIMIT 20;

Queue throughput (last hour)

SELECT queue_name,
       COUNT(*) FILTER (WHERE status = 'completed') as completed,
       COUNT(*) FILTER (WHERE status = 'failed') as failed,
       AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_duration_sec
FROM jobs
WHERE completed_at > NOW() - INTERVAL '1 hour'
GROUP BY queue_name;