SLAs + SLOs

Every service, customer-facing, internal, vendor-provided, is either explicitly measured or implicitly evaluated on vibes. SLAs and SLOs are how you move from vibes to measured. Good ones align expectations; bad ones become legal liabilities; missing ones leave everyone disappointed for different reasons.

The vocabulary

SLO is the target you manage to. SLA is the number you're willing to be sued over. Never commit your SLO target as your SLA.

Why the gap matters

SLO: 99.95% uptime internal target.
SLA: 99.5% uptime contractual commitment.

99.95% ≈ 21 minutes downtime/month.
99.5% ≈ 3.6 hours downtime/month.

Gap: ~3 hours of cushion per month. That's your error budget.

Choosing SLIs

Good SLIs measure the thing customers experience, not the thing easy to measure. For an API:

Don't measure CPU utilization or database connections as an SLI, those are system metrics, not customer-experience metrics.

For support teams

SLIs that matter:

For internal services

Internal SLAs (sometimes called "XLAs", internal service commitments) matter too:

Internal teams without SLAs will always be the bottleneck. Internal SLAs force throughput expectations into the open.

The error budget

If your SLO is 99.9%, you have 0.1% of error budget, about 43 minutes of "downtime" per month. That budget is a resource:

When the budget is exhausted, stop. Freeze non-critical deploys. Focus on reliability until the budget recovers. This is the discipline that keeps engineering from shipping endlessly at the cost of stability.

The review cadence

The common mistakes

What good looks like

Related: Vendor management · Risk management · Customer success ops