#Service Level Expectations
#Purpose
Define internal service level expectations for Eko operations. These are engineering targets, not customer SLAs. Use for capacity planning, monitoring alerts, and incident severity.
#Page Observation Timeliness
| Cadence | Target | Tolerance | Escalation |
|---|
| Daily | Within 24h of next_check_at | +6h jitter acceptable | P2 if >30h |
| Weekly | Within 7d of next_check_at | +12h jitter acceptable | P2 if >8d |
#Implementation Notes
next_check_at includes 0-6h random jitter to prevent thundering herd
- One-check-per-day constraint:
UNIQUE (page_id, checked_day)
- Workers poll queue continuously; backlog indicates capacity issue
#Monitoring Signals
SELECT COUNT(*) FROM pages
WHERE is_active = TRUE
AND next_check_at < NOW() - INTERVAL '6 hours';
#Notification Delivery
| Mode | Target | Tolerance | Escalation |
|---|
| Immediate | Within 5 minutes of change detection | 15 min | P2 if >30 min |
| Daily Digest | Before 9:00 AM user timezone | 1 hour | P3 if missed |
#Delivery States
| Status | Target Duration |
|---|
queued → sending | < 30 seconds |
sending → sent | < 60 seconds |
sending → failed | Retry 3x, then mark failed |
#Deduplication
- Unique constraint on
(page_change_event_id, channel) prevents duplicates
- If constraint violated, skip silently (already delivered)
#Monitoring Signals
SELECT COUNT(*) FROM notification_delivery_log
WHERE status = 'queued'
AND created_at < NOW() - INTERVAL '5 minutes';
#Summary Quality
| Metric | Target | Verification |
|---|
| Confidence | ≥ 0.7 for published summaries | AI self-reported confidence |
| Fair-use compliance | Non-substitutive | Manual audit (sampling) |
| No hallucination | 0 fabricated facts | User reports, spot checks |
| Meaningful change only | No summary without change | change_detected = TRUE required |
#Quality Gates
- Pre-summary: Meaningful change detection must pass
- Post-summary: Confidence threshold check
- Audit: Sample review for fair-use compliance
#Non-substitutive Criteria
- Summary must not reproduce verbatim content
- User must still need to visit source for full context
- Focus on delta, not page content
#System Availability
| Component | Target Uptime | Degradation Mode |
|---|
| Web app | 99.5% | Static error page |
| Tracker worker | 99% | Queue backlog grows |
| Render worker | 95% | Fallback to text-only |
| Queue (Upstash) | 99.9% | Managed service |
| Database (Supabase) | 99.9% | Managed service |
#Worker Health
| Signal | Healthy | Warning | Critical |
|---|
| Queue depth | < 100 | 100-500 | > 500 |
| Check latency p95 | < 30s | 30-60s | > 60s |
| Error rate | < 1% | 1-5% | > 5% |
#Degradation Modes
| Scenario | Behavior |
|---|
| Render worker down | Fall back to text fetch; queue renders for retry |
| AI provider down | Queue summaries for retry; notify ops |
| Queue service down | Workers idle; no data loss (DB is source of truth) |
| Database read-only | Read operations continue; writes fail gracefully |
#Non-Commitments
These behaviors are explicitly not guaranteed:
| Behavior | Reason |
|---|
| Real-time change detection | Cadence-based polling only |
| Manual check triggers | Prevents abuse; respects cadence |
| Check on demand for free users | Resource constraints |
| Guaranteed render success | External sites may block |
| Summary for all changes | Meaningful change filter may reject |
#Incident Severity
| Severity | Criteria | Response Time |
|---|
| P1 | Data loss, security breach, full outage | < 1 hour |
| P2 | Degraded service, >10% users affected | < 4 hours |
| P3 | Minor degradation, workaround exists | < 24 hours |
| P4 | Cosmetic, single-user impact | Next sprint |
Reference: Incident Playbook
#Implementation References
| Component | File |
|---|
| Queue configuration | packages/queue/ |
| Worker implementation | apps/worker-tracker/, apps/worker-render/ |
| Notification system | notification_delivery_log table |
| Monitoring | packages/observability/ |