SLAs + SLOs
📖 6 min readUpdated 2026-04-18
Every service, customer-facing, internal, vendor-provided, is either explicitly measured or implicitly evaluated on vibes. SLAs and SLOs are how you move from vibes to measured. Good ones align expectations; bad ones become legal liabilities; missing ones leave everyone disappointed for different reasons.
The vocabulary
- SLI (Service Level Indicator), the actual metric. "Request latency p95." "Uptime %." "First-response time."
- SLO (Service Level Objective), your internal target. "p95 latency < 300ms." "Uptime ≥ 99.9%." "First-response < 1 business hour."
- SLA (Service Level Agreement), your external, contractual commitment. Usually looser than the SLO. "99.5% uptime, with service credits if missed."
SLO is the target you manage to. SLA is the number you're willing to be sued over. Never commit your SLO target as your SLA.
Why the gap matters
SLO: 99.95% uptime internal target.
SLA: 99.5% uptime contractual commitment.
99.95% ≈ 21 minutes downtime/month.
99.5% ≈ 3.6 hours downtime/month.
Gap: ~3 hours of cushion per month. That's your error budget.
Choosing SLIs
Good SLIs measure the thing customers experience, not the thing easy to measure. For an API:
- Availability. % of requests that returned a valid response (not an HTTP error)
- Latency. % of requests served below threshold
- Correctness. % of responses matching expected output
Don't measure CPU utilization or database connections as an SLI, those are system metrics, not customer-experience metrics.
For support teams
SLIs that matter:
- First response time, initial reply from human or automation
- Resolution time, by severity (P1 < 4 hours, P2 < 1 business day, P3 < 5 business days)
- Customer satisfaction on resolution (CSAT)
- Escalation rate. % of tickets escalated beyond first-line
For internal services
Internal SLAs (sometimes called "XLAs", internal service commitments) matter too:
- Finance closes the books by day 10 of each month
- Recruiting presents first-pass candidates within 5 business days of job opening
- IT resolves laptop issues within 4 business hours
- Legal reviews standard NDA within 2 business days
Internal teams without SLAs will always be the bottleneck. Internal SLAs force throughput expectations into the open.
The error budget
If your SLO is 99.9%, you have 0.1% of error budget, about 43 minutes of "downtime" per month. That budget is a resource:
- Spend it on planned maintenance
- Spend it on deployments (riskier deploys that move the product forward)
- Spend it on experimentation (A/B tests that affect performance)
When the budget is exhausted, stop. Freeze non-critical deploys. Focus on reliability until the budget recovers. This is the discipline that keeps engineering from shipping endlessly at the cost of stability.
The review cadence
- Weekly, engineering reviews SLI performance, reviews incidents, adjusts next week's priorities
- Monthly. SLA performance reviewed with account teams; any customer-facing misses escalate
- Quarterly. SLO targets reviewed; are they still right for the business stage?
The common mistakes
- SLO = SLA. No cushion. First miss becomes a contractual breach.
- Too aggressive. "100% uptime" is not a target; it's a fantasy.
- Measuring averages. Use percentiles (p95, p99). Averages hide the worst customer experiences.
- No consequences for missing. SLOs with no operational consequence are reports, not commitments.
- Service credits that never get issued. If you miss the SLA and the customer has to fight to get their credit, you've lost them anyway.
What good looks like
- Customer-facing services have documented SLAs and internal SLOs
- Internal teams have committed response/resolution times for their consumers
- Error budgets are tracked and exhausted budgets trigger freeze
- SLI measurement is automated and visible on a live dashboard
- Missed SLAs trigger service credits automatically without customer intervention
Related: Vendor management · Risk management · Customer success ops