slaslosli

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.

Niketa SharmaJan 26, 202614 min read

You've seen the sales deck: "99.9% uptime guaranteed."

Ask the engineering team what that means. What happens when you miss it? Who decides what counts as downtime?

Often, nobody can answer quickly.

SLA, SLO, and SLI get used interchangeably. Teams set arbitrary targets ("let's do 99.9% because everyone else does"), then wonder why customers are angry when "nothing technically broke."

These aren't synonyms. They serve completely different purposes.

Here's what each one actually means and how to use them without creating busywork.

What You'll Learn

  • What SLI, SLO, and SLA actually mean (and why the order matters)
  • How to pick SLIs that customers care about (not just what's easy to measure)
  • How to set realistic SLO targets (not copy-paste 99.9%)
  • Error budgets: the framework that stops "is this urgent?" arguments
  • Copy-paste SLO template (30-minute setup)
  • Common mistakes and how to avoid them

SLI: What You Measure

Service Level Indicator. The actual metric you track.

SLI is the measurement. SLO is the target. SLA is the promise.

Good SLIs: Error rate, latency (p95, p99), availability. Things customers notice.

Bad SLIs: CPU utilization, memory usage, disk space. Things ops teams notice but users don't.

The trap: picking SLIs because they're easy to measure, not because they matter.

Track CPU as your SLI and you'll spend months optimizing it. Meanwhile, API latency spikes to 5 seconds and customers can't log in. Your dashboard looks perfect. Customers are furious.

The rule: If a user wouldn't notice it breaking, it's not an SLI. It's just a metric.

Common SLIs by Service Type

Service Type | Good SLI | Why It Matters
Service Type Good SLI Why It Matters
API Success rate (2xx/total requests) Users see errors directly
API Latency (p95 < 500ms) Slow = broken for users
Database Query success rate Failed queries = broken features
Frontend Time to interactive Users abandon slow pages
Background jobs Processing time per job Delayed jobs = broken workflows

Pick 1-2 SLIs per service. More than that and you're tracking everything, optimizing nothing.

SLO: Your Internal Target

Service Level Objective. The number you're aiming for.

SLOs are internal targets. SLAs are external promises.

Example: "99.5% of API requests succeed within 500ms."

  • SLI = request success rate + latency
  • SLO = 99.5% threshold

SLOs are internal. You don't publish them to customers. They're how engineering defines "good enough" and aligns with incident response playbooks.

How to Pick an SLO (Don't Copy-Paste 99.9%)

Step 1: Look at your last 30 days

What are you actually delivering right now?

If you're at 99.3%, don't set a target of 99.9%. You'll miss it immediately and the number becomes meaningless.

Step 2: Set the target slightly below current reality

Give yourself room for bad days.

  • Current performance: 99.7%
  • Target SLO: 99.5%
  • Buffer: 0.2% for unexpected issues

Step 3: Validate it maps to user experience

Ask: "If we hit 99.5%, will customers be happy?"

If the answer is no, your SLI is wrong (not your target).

Monthly vs Weekly SLOs

Most teams use monthly SLOs because:

  • SLAs (contracts) are typically monthly
  • Industry standard for reporting
  • Easier to absorb bad days

But track weekly burn rate to avoid surprises:

  • Monthly SLO: 99.5% = 216 minutes allowed downtime
  • Weekly burn rate: 216 ÷ 4.33 ≈ 50 minutes/week
  • If you burn 200 minutes in week 1, you're in trouble

Policy example:

  • Track monthly SLO (99.5%)
  • Review weekly burn rate
  • Trigger escalation at 50% of monthly budget burned

The Cost of Nines

Each additional "9" is often an order-of-magnitude more effort/cost, depending on architecture and org maturity.

Uptime Target | Downtime/Year | Downtime/Month | What It Takes
Uptime Target Downtime/Year Downtime/Month What It Takes
99% 3.65 days ~7.2 hours Basic monitoring, manual responses
99.5% 1.83 days ~3.6 hours Automated alerts, on-call rotation
99.9% 8.77 hours ~43 minutes Redundancy, automated failover
99.99% 52 minutes ~4 minutes Multi-region, chaos engineering

Promise 99.99% to win a deal and you might spend $50k/month on infrastructure for a $5k/month customer.

Sales shouldn't set SLOs without engineering sign-off.

SLA: Your External Promise

Service Level Agreement. The contract with consequences.

SLAs are external. They define what happens when you miss your target.

Example: "We commit to 99.5% monthly uptime. If we fall below, you get a 10% service credit."

Who Needs an SLA?

Yes:

  • B2B selling to enterprises
  • Contracts with procurement teams
  • Customers who require guaranteed uptime

No:

  • Early-stage startups (under 50 customers)
  • Internal tools
  • Self-serve products with monthly billing

A 20-person startup calculating SLA credits for $50/month customers is creating accounting busywork without meaningful upside.

Smart Buffer: Internal SLO > External SLA

Don't promise externally what you barely deliver internally.

Example setup:

  • Internal SLO: 99.7% (what engineering targets)
  • External SLA: 99.5% (what customers get promised)
  • Buffer: 0.2% for unexpected issues

Gives you room to have a bad week without breaching customer contracts.

Error Budget: What Makes This Actually Useful

Error budget is how teams decide: ship features, or pay down reliability debt.

SLOs without error budgets are just numbers on a dashboard.

Error budgets turn SLOs into a prioritization framework.

The Math

Error budget = 100% - SLO target

If your SLO is 99.5%, your error budget is 0.5%.

SLO Target | Error Budget/Month | Weekly Burn Rate Estimate
SLO Target Error Budget/Month Weekly Burn Rate Estimate
99.9% ~43 minutes ~10 minutes
99.5% ~3.6 hours ~50 minutes
99% ~7.2 hours ~1.7 hours

Weekly burn rate = monthly budget ÷ 4.33 weeks. Track weekly to avoid burning entire monthly budget early.

How Teams Use Error Budgets

The rule: If you have budget left, ship features. If you're burning budget, stop shipping and fix reliability.

Example policy:

  • Weekly error budget drops below 50%? → Triage. Identify root cause.
  • Weekly error budget drops below 20%? → Feature freeze. Reliability becomes priority #1.
  • Error budget refills weekly. Start fresh every Monday.

No more arguments about "is this urgent?"

Burning error budget = urgent. Not burning = queue it.

How to Set Your First SLO in 30 Minutes

Here's the step-by-step process.

Step 1: Pick Your Most Important Service (5 minutes)

Start with one service. The one customers complain about when it breaks.

API? Database? Frontend?

Step 2: Choose 1-2 SLIs (10 minutes)

Ask: "What do users notice when this breaks?"

For an API:

  • Success rate (requests returning 2xx / total requests)
  • Latency (p95 response time)

For a database:

  • Query success rate
  • Query latency (p99)

For a frontend:

  • Page load time (p95)
  • Time to interactive

Pick the one that matters most. Don't track everything.

Step 3: Measure Current Performance (10 minutes)

Pull the last 30 days of data.

What's your actual success rate? 99.2%? 99.7%? 98.5%?

Be honest. No aspirational numbers.

Step 4: Set Target Slightly Below Reality (5 minutes)

  • Current: 99.7%
  • Target SLO: 99.5%

Give yourself buffer.

Done. You Have an SLO.

Now track it weekly. When you burn error budget, investigate. When you have budget, ship features.

SLO Template (Copy-Paste)

Use this to document your first SLO.

## SLO: [Service Name]

**Service:** [e.g., Payment API]
**Owner:** [Team name]
**Last updated:** [Date]

### SLI (What We Measure)
- Metric: [e.g., Request success rate]
- Definition: [e.g., HTTP 2xx responses / total requests]
- Measurement window: [e.g., Monthly, evaluated weekly]

### SLO (Our Target)
- Target: [e.g., 99.5% success rate]
- Current performance (last 30 days): [e.g., 99.7%]
- Error budget: [e.g., 0.5% = 216 minutes/month or ~50 minutes/week burn rate]

### SLA (External Promise) - Optional
- Customer promise: [e.g., 99.5% monthly uptime]
- Consequence: [e.g., 10% service credit if breached]
- Measurement period: [e.g., Monthly]

### Escalation Policy
- Error budget < 50%: Triage, identify root cause
- Error budget < 20%: Feature freeze, fix reliability
- Error budget refills: Weekly (every Monday)

Combine with [incident severity levels](/blog/incident-severity-levels) to align response urgency.

### How We Measure
- Dashboard: [Link to dashboard]
- Alert: [Link to alert config]
- On-call: [Link to on-call schedule]

Copy this. Fill in the blanks. You're done.

Real Examples (What This Looks Like in Practice)

Here are common patterns.

Example 1: API Service (B2B SaaS)

Service: User authentication API
SLI: Request success rate
Internal SLO: 99.7% weekly
External SLA: 99.5% monthly
Error budget: ~30 min/week (internal), ~3.6 hours/month (external)

How they use it:

  • Daily dashboard shows weekly SLO burn rate
  • If weekly drops below 99.5%, all-hands triage
  • Sales can't promise below 99.5% without engineering sign-off
  • If error budget hits 20%, feature work pauses

Why it works: Clear line between "we're fine" and "drop everything."

Example 2: Background Job Processing

Service: Email sending queue
SLI: Processing time per job
Internal SLO: 95% of jobs processed within 5 minutes
External SLA: None (internal tool)
Error budget: 5% of jobs can exceed 5 minutes

How they use it:

  • Jobs taking > 5 minutes get logged
  • If more than 5% exceed threshold in a day, investigate
  • No external SLA because it's internal tooling

Why it works: Simple threshold, no customer promises needed.

Example 3: The Team That Set 99.99% and Regretted It

A startup promised 99.99% uptime to land an enterprise deal.

The contract was $10k/month. The infrastructure to deliver 99.99%? $30k/month in redundancy, multi-region failover, and 24/7 on-call. Build your schedule → Free On-Call Builder

Six months in, they renegotiated down to 99.5%. The customer didn't care (they never checked the SLA). Engineering stopped hemorrhaging budget.

The lesson: Don't promise nines you can't afford.

What Teams Get Wrong

Mistake 1: Copying 99.9% Without Doing the Math

99.9% uptime = ~8.7 hours/year downtime allowed
99.99% uptime = ~52 minutes/year downtime allowed

The gap is often an order-of-magnitude more expensive to achieve.

Chase 99.99% because a competitor claimed it and you'll discover they measured it differently.

Mistake 2: Setting SLOs You Can't Measure

Team sets 99.9% uptime but doesn't have:

  • Automated monitoring
  • Clear definition of what counts as "down"
  • Alerting when they're out of SLO

Your SLO is 99.9%. Someone asks "how did we do last month?" and the answer is "we haven't set that up yet."

That's not an SLO. That's a goal written on a napkin.

Mistake 3: No Buffer Between Internal and External

Team sets:

  • Internal SLO: 99.5%
  • External SLA: 99.5%

First bad week? Immediate SLA breach. Customer credits. Angry emails.

Better:

  • Internal SLO: 99.7%
  • External SLA: 99.5%
  • Buffer: 0.2% wiggle room

Gives you space to have a bad week without breaching contracts.

Mistake 4: Too Many SLOs

Team tracks 15 SLOs across 3 services.

Result: Everything's yellow. Nothing's a priority. Analysis paralysis.

Better: 1-2 SLOs per service. Track what matters. Ignore the rest.

Mistake 5: SLOs Nobody Checks

Team sets SLOs in a wiki. Nobody looks at them until a customer complains.

Better: Daily dashboard. Weekly review. Automated alerts when burning error budget.

If nobody's checking your SLO, you don't have an SLO.

Error Budget Calculator

Use this to calculate your error budget.

Formula:

Error budget (minutes/month) = (100% - SLO%) × 43,200 minutes

Examples:

SLO | Calculation | Error Budget/Month
SLO Calculation Error Budget/Month
99.9% (100% - 99.9%) × 43,200 43.2 minutes
99.5% (100% - 99.5%) × 43,200 216 minutes (3.6 hours)
99% (100% - 99%) × 43,200 432 minutes (7.2 hours)
95% (100% - 95%) × 43,200 2,160 minutes (36 hours)

Weekly estimate (from a monthly SLO):
Divide the monthly minutes by 4.33 (weeks per month)

99.5% monthly SLO = ~50 minutes/week error budget

Quick Reference

Term | What It Is | Who Sets It | Example | Public?
Term What It Is Who Sets It Example Public?
SLI The metric you track Engineering Error rate, latency No
SLO Your internal target Engineering 99.5% success rate No
SLA Your external promise Business/Legal "99.5% uptime or 10% credit" Yes

Key insight: SLIs and SLOs are for engineering. SLAs are for customers and contracts.

The Bottom Line

  • SLI = what you measure (pick what users notice, not what's easy)
  • SLO = your internal target (set it below current reality, not aspirational)
  • SLA = your external promise (only if selling to enterprises)

Use error budgets to drive prioritization. Stop arguing about "is this urgent?" Let your error budget decide.

Start with 1 service, 1-2 SLIs, 1 SLO. Add complexity only when needed.

If you're setting SLOs based on competitor claims, you'll end up optimizing the wrong thing. Set them based on what you can actually deliver, then improve.

Common Questions

What's the difference between SLO and SLA?
SLO = internal target (what engineering aims for). SLA = external promise with contractual consequences (what customers get). Your SLO should be stricter than your SLA to give yourself buffer.
What SLO should I set?
Look at your last 30 days of actual performance. Set the target slightly below that (0.2-0.5% buffer). Don't copy-paste 99.9% because it sounds good.
Do I need an SLA?
Only if you're selling to enterprises that require contractual guarantees. Most startups don't need SLAs until Series B+. Internal tools never need SLAs.
How many SLOs should I have?
Start with 1-2 per service. More than that and you're tracking everything, prioritizing nothing. Focus beats coverage.
What if we miss our SLO?
Nothing happens contractually (that's what SLAs are for). But if you miss consistently, either (1) you have a reliability problem, or (2) your target is wrong. Investigate which.
How do I calculate error budget?
Error budget = 100% - SLO target. For 99.5% SLO, error budget is 0.5%. In a 30-day month (43,200 minutes), that's 216 minutes or 3.6 hours of allowed downtime.
What's a realistic SLO for a startup?
99% to 99.5% is realistic for most startups. 99.9% requires significant investment. 99.99% is overkill unless you're in fintech, healthcare, or selling to enterprises with hard requirements.
Should internal tools have SLOs?
Only if they're critical. Your deployment pipeline? Maybe. Your internal wiki? Probably not. Don't create SLO overhead for tools that don't need it.
How often should I review SLOs?
Weekly for error budget burn. Quarterly for target adjustments. If your performance drifts significantly (up or down), update the SLO target.

Next Reads

Share this article

Found this helpful? Share it with your team.

Related Articles

Mar 28, 2026

Your AI agent already knows your system better than ours ever will

Every incident management vendor is building their own AI. We think that's backwards. Your agent already has the context. It just needs an API to act on incidents.

Read more
Mar 24, 2026

Incident management for early-stage engineering teams

How to set up incident management for early-stage engineering teams. Severity levels, on-call, escalation, and postmortems in the right order. Defaults that work from 15 to 100 engineers.

Read more
Mar 16, 2026

Your Agent Can Manage Incidents Now

We shipped an MCP server for managing incidents from Claude Code and Cursor. On-call, escalation, paging, and postmortems. Here's how we designed it for agents that live in your IDE.

Read more
Mar 13, 2026

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

OpsGenie shuts down April 2027. Two vendors got acquired, one went maintenance-only. Here's what's left, what it really costs, and how to decide.

Read more
Mar 10, 2026

Build, Open Source, or Buy Incident Management in 2026

Back-of-napkin 3-year TCO for a 20-person team: build ($233K to $395K), open source ($99K to $360K), or buy ($11K to $83K). What AI changes and what it doesn't.

Read more
Mar 8, 2026

Slack Incident Management: What Works and What Breaks

A practical guide to running incidents in Slack. What actually works at different team sizes, where Slack falls apart, and when to move beyond emoji reactions and manual channels.

Read more
Mar 5, 2026

PagerDuty Alternatives 2026: Pricing and Features Compared

Which PagerDuty alternative fits your team? Pricing, integrations, and on-call compared for teams from 10 to 200+ engineers.

Read more
Feb 1, 2026

Incident Communication Templates: 8 Free Examples [Copy-Paste]

Stop writing updates at 2 AM. 8 free templates for status pages, exec emails, customer updates, and social posts. Copy and use in 2 minutes.

Read more
Jan 24, 2026

Runbook vs Playbook: The Difference That Confuses Everyone

Runbooks document technical execution. Playbooks document roles, escalation, and comms. Here's when to use each, with copy-paste templates.

Read more
Jan 23, 2026

OpsGenie Shutdown 2027: The Complete Migration Guide

OpsGenie ends support April 2027. Step-by-step export guide, timeline, and pricing for 7 alternatives. Most teams need 6-8 weeks.

Read more
Jan 19, 2026

How to Reduce MTTR in 2026: The Coordination Framework

MTTR isn't just about debugging faster. Learn why coordination is the biggest lever for reducing incident duration for startups scaling from seed to Series C.

Read more
Jan 17, 2026

Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

Stop debating SEV1 vs P1. Covers both SEV and P0–P4 frameworks. Free copy-paste matrix, decision tree, and rollout plan.

Read more
Jan 15, 2026

Incident Management vs Incident Response: What's the Difference?

Don't confuse response with management. Learn why fast MTTR isn't enough to stop recurring fires and how to build a long-term incident lifecycle.

Read more
Jan 10, 2026

State of Incident Management 2026: Toil Rose 30% Despite AI

~$9.4M wasted per 250 engineers annually. Toil rose 30% in 2025, the first increase in 5 years. Data from 20+ reports and 25+ team interviews.

Read more
Jan 7, 2026

Slack Incident Response Playbook: Roles, Scripts & Templates

Stop the 3 AM chaos. Copy our battle-tested Slack incident playbook: includes scripts, roles, escalation rules, and templates for production outages.

Read more
Jan 2, 2026

On-Call Rotation: Schedules, Handoffs & Templates

Build a fair on-call rotation with schedule templates, a 2-minute handoff checklist, and primary/backup examples. Includes a free on-call builder tool.

Read more
Dec 29, 2025

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Stop writing postmortems nobody reads. 3 blameless templates (15-min, standard, comprehensive). Copy in one click, done in 48 hours.

Read more
Dec 22, 2025

Incident Coordination: Cut Context Switching, Fix Faster

Outages cost less than the coordination chaos around them. The 10-minute framework 25+ teams use to reduce coordination overhead and context switching during incidents.

Read more
Dec 15, 2025

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Is your incident process breaking as you grow? Learn the 4 stages of incident management for teams of 40-180. Scale your SRE practices without the chaos.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates, all without leaving Slack.

Get Started Free