SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

You've seen the sales deck: "99.9% uptime guaranteed."

Ask the engineering team what that means. What happens when you miss it? Who decides what counts as downtime?

Often, nobody can answer quickly.

SLA, SLO, and SLI get used interchangeably. Teams set arbitrary targets ("let's do 99.9% because everyone else does"), then wonder why customers are angry when "nothing technically broke."

These aren't synonyms. They serve completely different purposes.

Here's what each one actually means and how to use them without creating busywork.

What You'll Learn

What SLI, SLO, and SLA actually mean (and why the order matters)
How to pick SLIs that customers care about (not just what's easy to measure)
How to set realistic SLO targets (not copy-paste 99.9%)
Error budgets: the framework that stops "is this urgent?" arguments
Copy-paste SLO template (30-minute setup)
Common mistakes and how to avoid them

SLI: What You Measure

Service Level Indicator. The actual metric you track.

SLI is the measurement. SLO is the target. SLA is the promise.

Good SLIs: Error rate, latency (p95, p99), availability. Things customers notice.

Bad SLIs: CPU utilization, memory usage, disk space. Things ops teams notice but users don't.

The trap: picking SLIs because they're easy to measure, not because they matter.

Track CPU as your SLI and you'll spend months optimizing it. Meanwhile, API latency spikes to 5 seconds and customers can't log in. Your dashboard looks perfect. Customers are furious.

The rule: If a user wouldn't notice it breaking, it's not an SLI. It's just a metric.

Common SLIs by Service Type

Service Type | Good SLI | Why It Matters
Service Type	Good SLI	Why It Matters
API	Success rate (2xx/total requests)	Users see errors directly
API	Latency (p95 < 500ms)	Slow = broken for users
Database	Query success rate	Failed queries = broken features
Frontend	Time to interactive	Users abandon slow pages
Background jobs	Processing time per job	Delayed jobs = broken workflows

Pick 1-2 SLIs per service. More than that and you're tracking everything, optimizing nothing.

SLO: Your Internal Target

Service Level Objective. The number you're aiming for.

SLOs are internal targets. SLAs are external promises.

Example: "99.5% of API requests succeed within 500ms."

SLI = request success rate + latency
SLO = 99.5% threshold

SLOs are internal. You don't publish them to customers. They're how engineering defines "good enough" and aligns with incident response playbooks.

How to Pick an SLO (Don't Copy-Paste 99.9%)

Step 1: Look at your last 30 days

What are you actually delivering right now?

If you're at 99.3%, don't set a target of 99.9%. You'll miss it immediately and the number becomes meaningless.

Step 2: Set the target slightly below current reality

Give yourself room for bad days.

Current performance: 99.7%
Target SLO: 99.5%
Buffer: 0.2% for unexpected issues

Step 3: Validate it maps to user experience

Ask: "If we hit 99.5%, will customers be happy?"

If the answer is no, your SLI is wrong (not your target).

Monthly vs Weekly SLOs

Most teams use monthly SLOs because:

SLAs (contracts) are typically monthly
Industry standard for reporting
Easier to absorb bad days

But track weekly burn rate to avoid surprises:

Monthly SLO: 99.5% = 216 minutes allowed downtime
Weekly burn rate: 216 ÷ 4.33 ≈ 50 minutes/week
If you burn 200 minutes in week 1, you're in trouble

Policy example:

Track monthly SLO (99.5%)
Review weekly burn rate
Trigger escalation at 50% of monthly budget burned

The Cost of Nines

Each additional "9" is often an order-of-magnitude more effort/cost, depending on architecture and org maturity.

Uptime Target | Downtime/Year | Downtime/Month | What It Takes
Uptime Target	Downtime/Year	Downtime/Month	What It Takes
99%	3.65 days	~7.2 hours	Basic monitoring, manual responses
99.5%	1.83 days	~3.6 hours	Automated alerts, on-call rotation
99.9%	8.77 hours	~43 minutes	Redundancy, automated failover
99.99%	52 minutes	~4 minutes	Multi-region, chaos engineering

Promise 99.99% to win a deal and you might spend $50k/month on infrastructure for a $5k/month customer.

Sales shouldn't set SLOs without engineering sign-off.

SLA: Your External Promise

Service Level Agreement. The contract with consequences.

SLAs are external. They define what happens when you miss your target.

Example: "We commit to 99.5% monthly uptime. If we fall below, you get a 10% service credit."

Who Needs an SLA?

Yes:

B2B selling to enterprises
Contracts with procurement teams
Customers who require guaranteed uptime

No:

Early-stage startups (under 50 customers)
Internal tools
Self-serve products with monthly billing

A 20-person startup calculating SLA credits for $50/month customers is creating accounting busywork without meaningful upside.

Smart Buffer: Internal SLO > External SLA

Don't promise externally what you barely deliver internally.

Example setup:

Internal SLO: 99.7% (what engineering targets)
External SLA: 99.5% (what customers get promised)
Buffer: 0.2% for unexpected issues

Gives you room to have a bad week without breaching customer contracts.

Error Budget: What Makes This Actually Useful

Error budget is how teams decide: ship features, or pay down reliability debt.

SLOs without error budgets are just numbers on a dashboard.

Error budgets turn SLOs into a prioritization framework.

The Math

Error budget = 100% - SLO target

If your SLO is 99.5%, your error budget is 0.5%.

SLO Target | Error Budget/Month | Weekly Burn Rate Estimate
SLO Target	Error Budget/Month	Weekly Burn Rate Estimate
99.9%	~43 minutes	~10 minutes
99.5%	~3.6 hours	~50 minutes
99%	~7.2 hours	~1.7 hours

Weekly burn rate = monthly budget ÷ 4.33 weeks. Track weekly to avoid burning entire monthly budget early.

How Teams Use Error Budgets

The rule: If you have budget left, ship features. If you're burning budget, stop shipping and fix reliability.

Example policy:

Weekly error budget drops below 50%? → Triage. Identify root cause.
Weekly error budget drops below 20%? → Feature freeze. Reliability becomes priority #1.
Error budget refills weekly. Start fresh every Monday.

No more arguments about "is this urgent?"

Burning error budget = urgent. Not burning = queue it.

How to Set Your First SLO in 30 Minutes

Here's the step-by-step process.

Step 1: Pick Your Most Important Service (5 minutes)

Start with one service. The one customers complain about when it breaks.

API? Database? Frontend?

Step 2: Choose 1-2 SLIs (10 minutes)

Ask: "What do users notice when this breaks?"

For an API:

Success rate (requests returning 2xx / total requests)
Latency (p95 response time)

For a database:

Query success rate
Query latency (p99)

For a frontend:

Page load time (p95)
Time to interactive

Pick the one that matters most. Don't track everything.

Step 3: Measure Current Performance (10 minutes)

Pull the last 30 days of data.

What's your actual success rate? 99.2%? 99.7%? 98.5%?

Be honest. No aspirational numbers.

Step 4: Set Target Slightly Below Reality (5 minutes)

Current: 99.7%
Target SLO: 99.5%

Give yourself buffer.

Done. You Have an SLO.

Now track it weekly. When you burn error budget, investigate. When you have budget, ship features.

SLO Template (Copy-Paste)

Use this to document your first SLO.

## SLO: [Service Name]

**Service:** [e.g., Payment API]
**Owner:** [Team name]
**Last updated:** [Date]

### SLI (What We Measure)
- Metric: [e.g., Request success rate]
- Definition: [e.g., HTTP 2xx responses / total requests]
- Measurement window: [e.g., Monthly, evaluated weekly]

### SLO (Our Target)
- Target: [e.g., 99.5% success rate]
- Current performance (last 30 days): [e.g., 99.7%]
- Error budget: [e.g., 0.5% = 216 minutes/month or ~50 minutes/week burn rate]

### SLA (External Promise) - Optional
- Customer promise: [e.g., 99.5% monthly uptime]
- Consequence: [e.g., 10% service credit if breached]
- Measurement period: [e.g., Monthly]

### Escalation Policy
- Error budget < 50%: Triage, identify root cause
- Error budget < 20%: Feature freeze, fix reliability
- Error budget refills: Weekly (every Monday)

Combine with [incident severity levels](/blog/incident-severity-levels) to align response urgency.

### How We Measure
- Dashboard: [Link to dashboard]
- Alert: [Link to alert config]
- On-call: [Link to on-call schedule]

Copy this. Fill in the blanks. You're done.

Real Examples (What This Looks Like in Practice)

Here are common patterns.

Example 1: API Service (B2B SaaS)

Service: User authentication API
SLI: Request success rate
Internal SLO: 99.7% weekly
External SLA: 99.5% monthly
Error budget: ~30 min/week (internal), ~3.6 hours/month (external)

How they use it:

Daily dashboard shows weekly SLO burn rate
If weekly drops below 99.5%, all-hands triage
Sales can't promise below 99.5% without engineering sign-off
If error budget hits 20%, feature work pauses

Why it works: Clear line between "we're fine" and "drop everything."

Example 2: Background Job Processing

Service: Email sending queue
SLI: Processing time per job
Internal SLO: 95% of jobs processed within 5 minutes
External SLA: None (internal tool)
Error budget: 5% of jobs can exceed 5 minutes

How they use it:

Jobs taking > 5 minutes get logged
If more than 5% exceed threshold in a day, investigate
No external SLA because it's internal tooling

Why it works: Simple threshold, no customer promises needed.

Example 3: The Team That Set 99.99% and Regretted It

A startup promised 99.99% uptime to land an enterprise deal.

The contract was $10k/month. The infrastructure to deliver 99.99%? $30k/month in redundancy, multi-region failover, and 24/7 on-call. Build your schedule → Free On-Call Builder

Six months in, they renegotiated down to 99.5%. The customer didn't care (they never checked the SLA). Engineering stopped hemorrhaging budget.

The lesson: Don't promise nines you can't afford.

What Teams Get Wrong

Mistake 1: Copying 99.9% Without Doing the Math

99.9% uptime = ~8.7 hours/year downtime allowed
99.99% uptime = ~52 minutes/year downtime allowed

The gap is often an order-of-magnitude more expensive to achieve.

Chase 99.99% because a competitor claimed it and you'll discover they measured it differently.

Mistake 2: Setting SLOs You Can't Measure

Team sets 99.9% uptime but doesn't have:

Automated monitoring
Clear definition of what counts as "down"
Alerting when they're out of SLO

Your SLO is 99.9%. Someone asks "how did we do last month?" and the answer is "we haven't set that up yet."

That's not an SLO. That's a goal written on a napkin.

Mistake 3: No Buffer Between Internal and External

Team sets:

Internal SLO: 99.5%
External SLA: 99.5%

First bad week? Immediate SLA breach. Customer credits. Angry emails.

Better:

Internal SLO: 99.7%
External SLA: 99.5%
Buffer: 0.2% wiggle room

Gives you space to have a bad week without breaching contracts.

Mistake 4: Too Many SLOs

Team tracks 15 SLOs across 3 services.

Result: Everything's yellow. Nothing's a priority. Analysis paralysis.

Better: 1-2 SLOs per service. Track what matters. Ignore the rest.

Mistake 5: SLOs Nobody Checks

Team sets SLOs in a wiki. Nobody looks at them until a customer complains.

Better: Daily dashboard. Weekly review. Automated alerts when burning error budget.

If nobody's checking your SLO, you don't have an SLO.

Error Budget Calculator

Use this to calculate your error budget.

Formula:

Error budget (minutes/month) = (100% - SLO%) × 43,200 minutes

Examples:

SLO | Calculation | Error Budget/Month
SLO	Calculation	Error Budget/Month
99.9%	(100% - 99.9%) × 43,200	43.2 minutes
99.5%	(100% - 99.5%) × 43,200	216 minutes (3.6 hours)
99%	(100% - 99%) × 43,200	432 minutes (7.2 hours)
95%	(100% - 95%) × 43,200	2,160 minutes (36 hours)

Weekly estimate (from a monthly SLO):
Divide the monthly minutes by 4.33 (weeks per month)

99.5% monthly SLO = ~50 minutes/week error budget

Quick Reference

Term | What It Is | Who Sets It | Example | Public?
Term	What It Is	Who Sets It	Example	Public?
SLI	The metric you track	Engineering	Error rate, latency	No
SLO	Your internal target	Engineering	99.5% success rate	No
SLA	Your external promise	Business/Legal	"99.5% uptime or 10% credit"	Yes

Key insight: SLIs and SLOs are for engineering. SLAs are for customers and contracts.

The Bottom Line

SLI = what you measure (pick what users notice, not what's easy)
SLO = your internal target (set it below current reality, not aspirational)
SLA = your external promise (only if selling to enterprises)

Use error budgets to drive prioritization. Stop arguing about "is this urgent?" Let your error budget decide.

Start with 1 service, 1-2 SLIs, 1 SLO. Add complexity only when needed.

If you're setting SLOs based on competitor claims, you'll end up optimizing the wrong thing. Set them based on what you can actually deliver, then improve.

Common Questions

What's the difference between SLO and SLA?

SLO = internal target (what engineering aims for). SLA = external promise with contractual consequences (what customers get). Your SLO should be stricter than your SLA to give yourself buffer.

What SLO should I set?

Look at your last 30 days of actual performance. Set the target slightly below that (0.2-0.5% buffer). Don't copy-paste 99.9% because it sounds good.

Do I need an SLA?

Only if you're selling to enterprises that require contractual guarantees. Most startups don't need SLAs until Series B+. Internal tools never need SLAs.

How many SLOs should I have?

Start with 1-2 per service. More than that and you're tracking everything, prioritizing nothing. Focus beats coverage.

What if we miss our SLO?

Nothing happens contractually (that's what SLAs are for). But if you miss consistently, either (1) you have a reliability problem, or (2) your target is wrong. Investigate which.

How do I calculate error budget?

Error budget = 100% - SLO target. For 99.5% SLO, error budget is 0.5%. In a 30-day month (43,200 minutes), that's 216 minutes or 3.6 hours of allowed downtime.

What's a realistic SLO for a startup?

99% to 99.5% is realistic for most startups. 99.9% requires significant investment. 99.99% is overkill unless you're in fintech, healthcare, or selling to enterprises with hard requirements.

Should internal tools have SLOs?

Only if they're critical. Your deployment pipeline? Maybe. Your internal wiki? Probably not. Don't create SLO overhead for tools that don't need it.

How often should I review SLOs?

Weekly for error budget burn. Quarterly for target adjustments. If your performance drifts significantly (up or down), update the SLO target.

What You'll Learn

SLI: What You Measure

Common SLIs by Service Type

SLO: Your Internal Target

How to Pick an SLO (Don't Copy-Paste 99.9%)

Monthly vs Weekly SLOs

The Cost of Nines

SLA: Your External Promise

Who Needs an SLA?

Smart Buffer: Internal SLO > External SLA

Error Budget: What Makes This Actually Useful

The Math

How Teams Use Error Budgets

How to Set Your First SLO in 30 Minutes

Step 1: Pick Your Most Important Service (5 minutes)

Step 2: Choose 1-2 SLIs (10 minutes)

Step 3: Measure Current Performance (10 minutes)

Step 4: Set Target Slightly Below Reality (5 minutes)

Done. You Have an SLO.

SLO Template (Copy-Paste)

Real Examples (What This Looks Like in Practice)

Example 1: API Service (B2B SaaS)

Example 2: Background Job Processing

Example 3: The Team That Set 99.99% and Regretted It

What Teams Get Wrong

Mistake 1: Copying 99.9% Without Doing the Math

Mistake 2: Setting SLOs You Can't Measure

Mistake 3: No Buffer Between Internal and External

Mistake 4: Too Many SLOs

Mistake 5: SLOs Nobody Checks

Error Budget Calculator

Quick Reference

The Bottom Line

Common Questions

Next Reads

Share this article

Related Articles

Your AI agent already knows your system better than ours ever will

Incident management for early-stage engineering teams

Your Agent Can Manage Incidents Now

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

Build, Open Source, or Buy Incident Management in 2026

Slack Incident Management: What Works and What Breaks

PagerDuty Alternatives 2026: Pricing and Features Compared

Incident Communication Templates: 8 Free Examples [Copy-Paste]

Runbook vs Playbook: The Difference That Confuses Everyone

OpsGenie Shutdown 2027: The Complete Migration Guide

How to Reduce MTTR in 2026: The Coordination Framework

Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

Incident Management vs Incident Response: What's the Difference?

State of Incident Management 2026: Toil Rose 30% Despite AI

Slack Incident Response Playbook: Roles, Scripts & Templates

On-Call Rotation: Schedules, Handoffs & Templates

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Incident Coordination: Cut Context Switching, Fix Faster

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Automate Your Incident Response