Scheduling Runbook

Purpose: Keep checks firing on time; resolve drift, missed windows, and dispatch failures.


First Diagnostic Question

Are checks not happening on time? (missed windows, drift, dispatch failures)

If yes, resolve scheduling issues before investigating tracker or queue problems.


Components

ComponentLocationPurpose
Cron triggerBetter Stack HeartbeatCalls enqueue endpoint hourly
Cron endpointapps/web/app/api/enqueue-due-urls/route.tsQueries due URLs and enqueues jobs
Ingest workerapps/worker-ingest/src/index.tsProcesses news ingestion queue messages
Schedule metadatatracked_urls.next_check_atDetermines when URL is due

Better Stack Heartbeat Setup

The URL enqueue scheduler is triggered by Better Stack Heartbeats (not Vercel CRON).

Configuration

  1. Log into Better Stack → Heartbeats → Create Heartbeat

  2. Settings:

    FieldValue
    Nameeko-enqueue-due-urls
    URLhttps://eko.day/api/enqueue-due-urls
    MethodGET
    ScheduleEvery 1 hour
    Grace period5 minutes
  3. Headers:

    Authorization: Bearer $CRON_SECRET
    

    (Use the same CRON_SECRET value from Vercel environment variables)

  4. Expected response: HTTP 200 with JSON body

  5. Alerting: Configure alerts for missed heartbeats

Benefits over Vercel CRON

  • Failure alerting: Notified if endpoint returns non-2xx
  • Execution history: Dashboard shows success/failure timeline
  • Retry logic: Automatic retries on transient failures
  • Platform-agnostic: Works if you migrate off Vercel

Invariants

  • Batch dispatch: Due URLs queried in batches (currently 100 at a time)
  • Idempotency: URL skipped if already checked today (UTC)
  • Deterministic advancement: next_check_at advances based on check_frequency after job completion
  • Authorization: Cron endpoint requires bearer token (CRON_SECRET)

Source of truth: STACK.md


Diagnostic Decision Tree

Scheduling issue suspected
    │
    ├─ Checks not starting?
    │   ├─ Cron not triggering → Check cron job config, verify CRON_SECRET
    │   ├─ Endpoint returning 401 → Authorization header missing or wrong
    │   └─ No due URLs found → Check next_check_at values in DB
    │
    ├─ Jobs enqueued but not processed?
    │   └─ See [Queue Runbook](./queue.md)
    │
    ├─ Schedule drift accumulating?
    │   ├─ next_check_at not advancing → Worker not updating after completion
    │   ├─ Jobs completing but late → Processing backlog
    │   └─ Frequency mismatch → check_frequency not matching expectations
    │
    └─ Duplicate checks happening?
        ├─ Idempotency guard failing → hasCheckedToday returning false incorrectly
        └─ Multiple cron triggers → Cron running too frequently

Common Failure Scenarios

Cron Not Triggering

Symptoms:

  • Queue depth at zero
  • No recent enqueue-due-urls logs
  • URLs overdue but not being checked

Actions:

  1. Check Better Stack Heartbeats dashboard for failures
  2. Verify heartbeat is configured and active
  3. Verify CRON_SECRET environment variable matches Better Stack header
  4. Manually trigger: curl -X GET -H "Authorization: Bearer $CRON_SECRET" https://eko.day/api/enqueue-due-urls
  5. Check Vercel function logs for 401/500 errors

Dispatch Failures

Symptoms:

  • Cron endpoint returning errors
  • URLs due but not being enqueued
  • Queue connection errors in logs

Actions:

  1. Check Upstash Redis connection
  2. Verify UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN
  3. Check for rate limiting on queue service
  4. Review error logs from enqueue-due-urls endpoint

Schedule Drift

Symptoms:

  • URLs checked later than expected
  • next_check_at values in the past
  • Growing gap between expected and actual check times

Actions:

  1. Check if worker is updating next_check_at after completion
  2. Verify updateNextCheckTime function is being called
  3. Check for processing backlogs (queue depth growing)
  4. Consider reducing batch size if system overloaded

Stop Conditions

Hard Stop

Trigger immediately if any are true:

  • Queue depth growing faster than drain rate for extended period
  • Duplicate checks causing data corruption

Action: Pause cron trigger and investigate.

Degrade Mode

  • Reduce batch size (from 100 to smaller number)
  • Throttle low-priority cadences (weekly before daily)
  • Extend cron interval temporarily

Resume once backlog stabilizes.


Monitoring Queries

Check overdue subscriptions

SELECT id, canonical_url, check_frequency, next_check_at,
       NOW() - next_check_at as overdue_by
FROM tracked_urls
WHERE next_check_at < NOW()
  AND is_active = true
ORDER BY next_check_at
LIMIT 20;

Find URLs that haven't been checked recently

SELECT tu.id, tu.canonical_url, tu.check_frequency,
       MAX(uc.checked_at) as last_checked
FROM tracked_urls tu
LEFT JOIN url_checks uc ON tu.id = uc.tracked_url_id
WHERE tu.is_active = true
GROUP BY tu.id
HAVING MAX(uc.checked_at) < NOW() - INTERVAL '2 days'
   OR MAX(uc.checked_at) IS NULL
ORDER BY last_checked NULLS FIRST
LIMIT 20;

Check dispatch lag (time between due and enqueue)

Review logs for api:enqueue-due-urls component:

  • Found due URLs count
  • Enqueue complete enqueued vs skipped