Service Level Expectations

Purpose

Define internal service level expectations for Eko operations. These are engineering targets, not customer SLAs. Use for capacity planning, monitoring alerts, and incident severity.


Page Observation Timeliness

CadenceTargetToleranceEscalation
DailyWithin 24h of next_check_at+6h jitter acceptableP2 if >30h
WeeklyWithin 7d of next_check_at+12h jitter acceptableP2 if >8d

Implementation Notes

  • next_check_at includes 0-6h random jitter to prevent thundering herd
  • One-check-per-day constraint: UNIQUE (page_id, checked_day)
  • Workers poll queue continuously; backlog indicates capacity issue

Monitoring Signals

-- Pages overdue for check
SELECT COUNT(*) FROM pages
WHERE is_active = TRUE
  AND next_check_at < NOW() - INTERVAL '6 hours';

Notification Delivery

ModeTargetToleranceEscalation
ImmediateWithin 5 minutes of change detection15 minP2 if >30 min
Daily DigestBefore 9:00 AM user timezone1 hourP3 if missed

Delivery States

StatusTarget Duration
queuedsending< 30 seconds
sendingsent< 60 seconds
sendingfailedRetry 3x, then mark failed

Deduplication

  • Unique constraint on (page_change_event_id, channel) prevents duplicates
  • If constraint violated, skip silently (already delivered)

Monitoring Signals

-- Notifications stuck in queue
SELECT COUNT(*) FROM notification_delivery_log
WHERE status = 'queued'
  AND created_at < NOW() - INTERVAL '5 minutes';

Summary Quality

MetricTargetVerification
Confidence≥ 0.7 for published summariesAI self-reported confidence
Fair-use complianceNon-substitutiveManual audit (sampling)
No hallucination0 fabricated factsUser reports, spot checks
Meaningful change onlyNo summary without changechange_detected = TRUE required

Quality Gates

  1. Pre-summary: Meaningful change detection must pass
  2. Post-summary: Confidence threshold check
  3. Audit: Sample review for fair-use compliance

Non-substitutive Criteria

  • Summary must not reproduce verbatim content
  • User must still need to visit source for full context
  • Focus on delta, not page content

System Availability

ComponentTarget UptimeDegradation Mode
Web app99.5%Static error page
Tracker worker99%Queue backlog grows
Render worker95%Fallback to text-only
Queue (Upstash)99.9%Managed service
Database (Supabase)99.9%Managed service

Worker Health

SignalHealthyWarningCritical
Queue depth< 100100-500> 500
Check latency p95< 30s30-60s> 60s
Error rate< 1%1-5%> 5%

Degradation Modes

ScenarioBehavior
Render worker downFall back to text fetch; queue renders for retry
AI provider downQueue summaries for retry; notify ops
Queue service downWorkers idle; no data loss (DB is source of truth)
Database read-onlyRead operations continue; writes fail gracefully

Non-Commitments

These behaviors are explicitly not guaranteed:

BehaviorReason
Real-time change detectionCadence-based polling only
Manual check triggersPrevents abuse; respects cadence
Check on demand for free usersResource constraints
Guaranteed render successExternal sites may block
Summary for all changesMeaningful change filter may reject

Incident Severity

SeverityCriteriaResponse Time
P1Data loss, security breach, full outage< 1 hour
P2Degraded service, >10% users affected< 4 hours
P3Minor degradation, workaround exists< 24 hours
P4Cosmetic, single-user impactNext sprint

Reference: Incident Playbook


Implementation References

ComponentFile
Queue configurationpackages/queue/
Worker implementationapps/worker-tracker/, apps/worker-render/
Notification systemnotification_delivery_log table
Monitoringpackages/observability/