Service Level Objectives (SLOs)
This document outlines our defined SLOs, alerting thresholds, and how we validate system performance under load.
SLO Definitions
Section titled “SLO Definitions”-
Run Start Latency (p95)
- Target: ≤ 2s to enqueue a workflow run after API
POST /runs. - Alert: p95 latency > 5s sustained for 5 minutes.
- Target: ≤ 2s to enqueue a workflow run after API
-
Run Success Rate
- Target: ≥ 99% of runs succeed end-to-end.
- Alert: success rate < 95% for 10 minutes.
-
Stream Gaps (SSE)
- Target: zero dropped or out-of-order events.
- Alert: >10 SSE gaps/minute per instance.
Alerts & Thresholds
Section titled “Alerts & Thresholds”- Run start latency: Alert
critical>5s,warning>3s. - Run success rate: Alert
critical<95%,warning<98%. - Stream gap rate: Alert
critical>50/min,warning>10/min.
Load Testing
Section titled “Load Testing”We use load tests (e.g. Locust, k6) to validate SLOs at scale:
- Ramp up to 1000 concurrent runs.
- Measure latency distribution, stream continuity, and worker throughput.
- Compare against defined targets.
Backpressure Behavior
Section titled “Backpressure Behavior”When the system is overloaded:
- API returns
429 Too Many RequestswithRetry-Afterheader. - Clients must back off and retry after suggested interval.
- This ensures event queues and API servers are not overwhelmed.
- SLO compliance is reviewed quarterly.
- Dashboards and alerting rules in Prometheus/Grafana enforce these thresholds.
- SLO outcomes feed into error budgets for operational planning.