Skip to content

Service Level Objectives (SLOs)

This document outlines our defined SLOs, alerting thresholds, and how we validate system performance under load.

SLO Definitions

Run Start Latency (p95)
- Target: ≤ 2s to enqueue a workflow run after API POST /runs.
- Alert: p95 latency > 5s sustained for 5 minutes.
Run Success Rate
- Target: ≥ 99% of runs succeed end-to-end.
- Alert: success rate < 95% for 10 minutes.
Stream Gaps (SSE)
- Target: zero dropped or out-of-order events.
- Alert: >10 SSE gaps/minute per instance.

Alerts & Thresholds

Run start latency: Alert critical >5s, warning >3s.
Run success rate: Alert critical <95%, warning <98%.
Stream gap rate: Alert critical >50/min, warning >10/min.

Load Testing

We use load tests (e.g. Locust, k6) to validate SLOs at scale:

Ramp up to 1000 concurrent runs.
Measure latency distribution, stream continuity, and worker throughput.
Compare against defined targets.

Backpressure Behavior

When the system is overloaded:

API returns 429 Too Many Requests with Retry-After header.
Clients must back off and retry after suggested interval.
This ensures event queues and API servers are not overwhelmed.

Notes

SLO compliance is reviewed quarterly.
Dashboards and alerting rules in Prometheus/Grafana enforce these thresholds.
SLO outcomes feed into error budgets for operational planning.