Runframe Blog

Your AI agent already knows your system better than ours ever will

Niketa Sharma — Sat, 28 Mar 2026 00:00:00 GMT

Every incident management vendor just shipped an AI agent. PagerDuty has one. incident.io has one. Even Linear just announced that agents are their entire future.

The pitch is always the same: "Our AI understands your incidents."

Here's the problem. Their AI doesn't know your codebase. It doesn't know that your payments service was rewritten last month, or that deploy #4,271 changed the retry logic, or that the last three outages were all caused by the same Redis connection pool. Their AI reads your incident titles and severity levels. That's it.

Your agent, the one running in Cursor or Claude Code or your custom pipeline, already knows all of that. It's read your code. It's seen your commits. It's helped you debug at 2 AM.

It just can't create an incident, page someone, or update a timeline. That's not an AI problem. That's an API problem.

The captive agent trap

Here's what's happening across the industry right now.

PagerDuty builds an AI agent that lives inside PagerDuty. It can summarize incidents and suggest runbooks, but only PagerDuty incidents, only PagerDuty runbooks. It doesn't know your deploy pipeline or your architecture.

incident.io builds a copilot that helps during incidents. It's useful inside their product. But it doesn't connect to your IDE, your CI/CD, your monitoring dashboards, or the agent that already knows your system.

Linear built agents as a core part of their product. Skills, automations, code intelligence, all built into Linear. Their framing is "the shared product system that turns context into execution."

Each of these is a captive agent. It lives inside the vendor's product, operates on the vendor's data, and sees your world through the vendor's lens.

The pitch sounds good in a demo. In practice, you end up with five different AI agents across five different tools, none of which talk to each other, each with a partial view of what's actually happening.

Why your own agent has more context

Think about what your agent already knows when an alert fires.

It's read the service that's failing. It knows the recent changes, can grep for the function that's throwing errors, and can tell you what changed in the last three deploys. It knows the payments service calls the billing service which calls Stripe. If it's been in your repo for a few weeks, it's picked up your deploy cadence, your branch strategy, how you test things. It's seen your postmortems.

An agent with access to Datadog or Grafana can correlate the alert with metrics, logs, and traces before anyone opens a browser tab.

No vendor-built AI will ever have this context. They'd need access to your entire codebase, your deploy history, your monitoring stack, and your team's communication patterns. That's not something you hand to every SaaS tool you use.

The API problem, not the AI problem

When your agent sees an alert, it can diagnose what's wrong. What it can't do without the right API is act on it.

It can't create an incident in your system of record. It can't check who's on call and page them. It can't escalate when no one responds. It can't log what it found to the timeline so the human responder walks in with full context.

This is an integration problem. The agent needs an API that lets it participate in the incident lifecycle the same way a human would.

That's what we built. (If you're weighing whether to build this yourself, we wrote up the three-year TCO math on build vs buy.)

npx @runframe/mcp-server --setup

A tightly scoped MCP server for the incident lifecycle, plus a full REST API. Your agent creates incidents, acknowledges them, pages responders, logs findings, escalates, and drafts postmortems. Doesn't matter if the agent is Claude, GPT, a custom model, or something that doesn't exist yet.

What this looks like in practice

An engineer is working in Cursor. A Datadog alert fires for elevated latency on the payments service.

Their agent, which already has the repo open, checks recent deploys, finds a retry logic change merged two hours ago, and creates an incident in Runframe with the relevant context. It checks who's on call, pages them with a summary that includes the suspected commit, and logs everything to the incident timeline.

The on-call engineer opens Slack, sees the page, and finds a timeline that already contains the alert details, the suspected root cause, the relevant commit, and a link to the diff. They're diagnosing in 30 seconds instead of 10 minutes.

No vendor AI did this. The engineer's own agent did, because it had the context and the API to act.

Captive vs. open: the architectural bet

This is a real architectural decision, not a marketing angle.

Captive agents are built by the vendor, trained on the vendor's data, and locked to the vendor's product. Easy to demo. Hard to extend. When you switch tools, the AI doesn't come with you.

Open agents are yours. They run in your IDE, your CI/CD, your custom pipelines. They use whatever model you want. When you switch vendors, the agent stays.

Captive agent | Open agent
	Captive agent	Open agent
Context	Only what the vendor sees	Your entire codebase + infra
Model	Vendor's choice	Your choice
Portability	Locked to vendor	Works across tools
Customization	Vendor's features	Your workflows
Cost	Bundled (opaque)	You control spend

Cursor, Claude Code, VS Code, Windsurf all support MCP. The agent that helps you write code is the same agent that should help you respond to incidents. The industry is heading that direction whether any individual vendor likes it or not.

"But isn't MCP dead?"

You've seen the posts. Perplexity's CTO moved away from MCP. Eric Holmes wrote "MCP is dead. Long live the CLI." A database MCP server with 106 tools burned 54,600 tokens just on tool discovery before doing anything useful. Security researchers found OAuth flaws, prompt injection vectors, and tool poisoning across open MCP servers.

These are real criticisms. And they mostly apply to MCP servers that shouldn't be MCP servers.

A database with 106 query tools? That's a bad MCP server. Of course the token overhead is brutal. You're asking the agent to discover and evaluate 106 tools it probably doesn't need. A CLI wrapper for git commands? Probably better as a CLI.

Runframe's MCP server is tightly scoped to one domain: the incident lifecycle. Create, acknowledge, escalate, page, resolve. An agent doesn't need to evaluate 106 options. It needs to manage an incident. The tool discovery overhead is minimal because the tool set is focused.

The critics are right that MCP isn't the answer for everything. But they're wrong that it's dead. 97 million monthly SDK downloads. 17,000+ servers. OpenAI, Google, Microsoft, and AWS all adopted it. The Linux Foundation is stewarding it as an open standard. Bloomberg cut deployment timelines from days to minutes.

What actually died is the hype phase. The "just add MCP to everything" era. What replaced it is pragmatic adoption: use MCP where agent-driven tool discovery matters, use direct APIs where the workflow is stable and known.

Incident management is one of the places where MCP fits well. An agent doesn't know ahead of time whether it'll need to create an incident, or just check who's on call, or escalate. The workflow depends on what's happening. That's what tool discovery is for.

And for teams that prefer direct API calls? We ship a full REST API too. Same capabilities, different interface. Use whatever your agent prefers.

What we're building first

We're not starting with a Runframe AI agent. We're starting with the API and MCP server that lets your agent operate through us.

Your team is already choosing its AI stack. Claude, GPT, open-source models, custom agents wired into deploy pipelines. The bigger gap right now isn't another vendor AI — it's that your agent can't create an incident, page someone, or write a postmortem.

That's what we're fixing first. An incident management platform that your existing agent can operate through. A system of record with a clean API and MCP support, so the agent you already trust can participate in the incident lifecycle.

That's Runframe.

The bottom line

Linear calls themselves "the shared product system that turns context into execution." That's a good line. Here's ours: Runframe is the incident system of record that your agents operate through.

Not our agent. Yours. We provide the API, the MCP server, the data model, and the notification system. Your agent provides the context.

Every incident management vendor is racing to build their own AI. PagerDuty, incident.io, Rootly, they're all shipping captive agents that live inside their products. We think this gets the architecture wrong. The best AI for your incidents is the one that already knows your code, your deploys, and your team's patterns. That's your agent, not ours.

What your agent needs is access. A clean API and MCP server that lets it participate in the incident lifecycle. We built that.

npx @runframe/mcp-server --setup

One MCP server, scoped to the incident lifecycle. Works with Cursor, Claude Code, VS Code, and Claude Desktop. Here's how to set it up.

Common questions

Doesn't this mean Runframe has no AI features?

Incident management for early-stage engineering teams

Niketa Sharma — Tue, 24 Mar 2026 00:00:00 GMT

At 15 engineers, you can get away without formal incident management. Someone posts in #engineering, the right person sees it, they fix it. It works until it doesn't.

Then you hit 20. Maybe 30. Someone pages the entire team at 2 AM because a staging dashboard loaded slow. The last real incident took 45 minutes before anyone figured out who should even be looking at it.

That's the inflection point. Not when things break (things always break) but when the coordination around the break starts costing more than the break itself. Some teams hit it at 15 people. Most feel it by 30.

This is the setup guide for that moment. What to set up, in what order, with opinionated defaults that work whether you're 15 engineers or 100.

TL;DR: Start with three severity levels (SEV1-3), set up weekly on-call with primary + backup, create a dedicated Slack channel per incident, wire automatic multi-channel escalation with a 5-minute timeout, and do one-page blameless postmortems within 48 hours. Skip everything else until one of these breaks.

What you'll set up

Three severity levels, enough to triage, not enough to argue about
On-call rotation, primary + backup, weekly, with real escalation
Incident channels, dedicated Slack channel per incident
Escalation that works, multi-channel, automatic, no gaps
Short postmortems, one page, 48 hours, blameless
What to skip, the stuff that doesn't matter yet

Start with three severity levels, not five

You need enough levels to make decisions, not so many that you start arguments about classification.

SEV1: Customers can't use the product. Revenue is affected. Drop everything.

SEV2: Something is degraded and customers notice, but there's a workaround. Painful, but not down.

SEV3: Minor or internal. Fix it during business hours.

Three levels. You can add SEV0 (apocalypse scenario) later when you have 50+ engineers and genuinely need a level above "drop everything." You can add SEV4 (proactive work) when you have enough incident volume to categorize prevention separately.

The mistake teams make is copying Google's severity framework on day one. They end up with five levels nobody can distinguish and spend the first 10 minutes of every incident arguing about whether it's a SEV2 or a SEV3.

When in doubt, classify higher. A SEV1 that turns out to be a SEV2 wastes some attention. A SEV2 that was actually a SEV1 wastes customer trust.

Use the severity level to decide two things: who gets paged, and how fast you need to respond. Everything else is overhead at this stage.

Put someone on-call before you need to

The worst time to figure out who's responsible is during an incident.

Most teams wait until after a bad incident to set up on-call. Then they scramble to build a rotation while half the team is still stressed about the last outage. Do it before you need it.

Start simple

Weekly rotation. Primary + backup. That's the minimum.

Primary is the person who gets paged first. Backup is the person who gets paged if primary doesn't respond. Without a backup, a single person in the shower or on a flight means nobody responds for 30 minutes.

Weekly works for most teams. Daily rotations are exhausting, nobody gets into a rhythm. Monthly rotations are too long, the on-call person burns out by week three and starts ignoring alerts.

Cover business hours first

If your customers are mostly in one timezone, start with business-hours on-call. You don't need 24/7 coverage on day one. Add it when your customer base or your SLAs demand it.

Acknowledge the burden

On-call is work. Engineers who carry pagers outside working hours deserve recognition. Some teams pay $200-500/week. Others give comp time. The specific mechanism matters less than the acknowledgment that being on-call is a real cost.

Treat on-call as free and the good engineers leave. It doesn't take long.

One channel per incident

Slack is where your team already works. Use it.

When an incident fires, create a dedicated channel for it. Not a thread in #engineering. Not a DM group. A channel named something obvious, like inc-42-checkout-api-down, where everything about this incident happens. The first responder creates it, from a standard name format, so there's no ambiguity about where to go.

Why this matters

Without a dedicated channel, updates scatter across DMs, threads, and the wrong channels. Someone asks "what's the latest?" and three people answer with three different versions. The CEO finds a 20-minute-old message and panics.

With one, there's one place to look. Status updates, debugging notes, decisions, all in the same channel. If the update isn't in the incident channel, it didn't happen.

How it works in practice

Alert fires, incident channel gets created, responders get pulled in. All updates go there. When it's resolved, archive the channel.

Keep the channel public. Leadership will check it during a SEV1 whether you invite them or not. Better they read a clean timeline than ping engineers for updates mid-debug.

Escalation is not optional

This is where most DIY setups fail. They page once and hope.

The failure mode looks like this: an alert fires at 2 AM. The on-call engineer's phone is on silent. Or they're sick. Or they looked at the notification and fell back asleep. Nobody else knows. Twenty minutes later, customers are complaining on Twitter and your CEO is texting the CTO asking what's happening.

Automatic, not manual

If the on-call person doesn't acknowledge within 5 minutes, escalate. Automatically. Don't rely on someone noticing and manually paging the backup. At 2 AM, nobody is watching.

What you want is an escalation chain where each step gets harder to ignore:

0 min: Slack DM + push notification to primary on-call
2-5 min: SMS and voice call to primary if still unacknowledged
5 min: Page the backup on-call, all channels
If neither responds: Escalate to engineering manager

Notice each step uses a more interruptive channel than the last. If your escalation sends another Slack message to someone who already missed the first one, you haven't escalated. You've just been louder in the same room.

Postmortems that people actually read

One page

Keep the postmortem to one page. Nobody reads the five-page ones, so they're worse than useless. They consume time to write and teach nothing because nobody opens them.

Answer three questions:

What happened? Timeline. What broke, when, what was the impact.
Why did it happen? Root cause. Not "the server crashed" but why the server crashed and why you didn't catch it earlier.
What are we changing? 1-3 specific action items with owners and deadlines.

If you need more detail for a major incident, add an appendix. But the core document that people read should fit on one page.

48-hour rule

If the postmortem isn't written within 48 hours, it won't get written. Details fade, people move on, the next sprint starts and nobody circles back.

Assign an owner immediately after the incident resolves. Not "the team," a specific person with a specific deadline.

Blameless is not optional

The first time someone gets called out in a postmortem, nobody writes honest ones again. Engineers will sanitize everything. The postmortem becomes theater, a document that exists to prove you did a postmortem, not to prevent the next incident.

Focus on systems, not people. "The deploy went out without a canary" not "Alex deployed without checking."

Not every incident needs one. SEV1: always. SEV2: judgment call, did you learn something? SEV3: a brief note in the incident timeline is enough.

What to skip (for now)

The biggest risk at this stage isn't missing a feature. It's overbuilding process that nobody follows.

Runbooks and playbooks can wait. You don't have enough incident patterns yet. After you've seen the same type of incident three times, write a runbook for it. Before that, you're writing fiction.

Don't bother with workflow automation either. Do the process manually for 20 incidents first. You'll learn what actually needs automating versus what you assumed would.

SLOs and error budgets? At 30 engineers, you already know your service is unreliable. You don't need a dashboard to confirm it. Unless you're selling to enterprise or running infra-heavy systems, in which case basic SLO thinking earlier doesn't hurt. But formal error budgets can wait until 100+ engineers when you need to make real tradeoffs between reliability and shipping.

For most B2B teams at this stage, reliable escalation matters more than a status page. If your customers expect proactive comms, use a hosted service. Don't build one.

And skip incident analytics for now. MTTR dashboards are meaningless if your escalation doesn't work and your postmortems aren't happening. Fix the process first.

Incident management setup checklist

Set these up in this order:

Three severity levels (SEV1, SEV2, SEV3). Classify fast, default higher.
On-call rotation with primary + backup, weekly. Acknowledge the burden.
One dedicated Slack channel per incident, kept public.
Automatic escalation across multiple channels. 5-minute timeout before it moves up.
One-page postmortems within 48 hours. Blameless. Specific owner.

Skip everything else until one of these breaks.

The goal is making the next incident less chaotic than the last one. Run these for a few months and you'll know what needs to change, because you'll have real incidents telling you.

If you want this setup without building it yourself, Runframe handles severity levels, on-call scheduling, multi-channel escalation, and postmortems out of the box. Free to start.

Once your process is running, read how teams scale incident management past 50 engineers for what comes next.

Common questions

When does a team need formal incident management?

Your Agent Can Manage Incidents Now

Runframe Team — Mon, 16 Mar 2026 00:00:00 GMT

An engineer on your team gets a Datadog alert while writing code in Cursor. Without switching tabs, their agent checks who's on call, acknowledges the incident, investigates recent deploys, pages the right responder, and logs everything to the timeline.

That's not a demo. That's what Runframe's MCP server does in Cursor and Claude Code today.

npx @runframe/mcp-server --setup

Works with Cursor, Claude Code, VS Code, and Claude Desktop.

Every incident management tool today assumes a human is clicking through every step. We built the MCP server for the workflows where that's no longer true, where an agent does the coordination and the engineer makes the calls.

What's in the box

Here's what we ship.

Incidents (9 tools):

list_incidents — filter by status, severity, team
get_incident — full details with timeline and participants
create_incident — spin one up from an alert
update_incident — change severity, assignment, description
change_incident_status — move through the workflow (investigating → fixing → resolved)
acknowledge_incident — ack it, auto-assign, track SLA
add_incident_event — log findings to the timeline
escalate_incident — escalate through the policy
page_someone — page a responder via Slack or email

On-call (1 tool):

get_current_oncall — who's on call right now, filterable by team

Services (2 tools):

list_services — search across services
get_service — details plus on-call instructions

Postmortems (2 tools):

create_postmortem — draft with root cause and action items
get_postmortem — pull up what happened

Teams (2 tools):

list_teams — see all teams
get_escalation_policy — who gets paged at each level

How an agent runs an incident

A Datadog alert fires for elevated API latency on the payments service.

First thing the agent does is call get_incident. SEV2, payments service, opened 3 minutes ago. The monitoring integration already logged the trigger on the timeline.

Then get_current_oncall, filtered to the payments team. Gets back the primary on-call engineer.

acknowledge_incident. The incident moves to "investigating." SLA clock starts. The rest of the team can see someone's on it.

The agent pulls logs from Datadog (separate MCP server), checks recent commits in the codebase, and finds a deploy 20 minutes ago that changed the payment retry logic. It calls add_incident_event with what it found: "Likely caused by deploy #1847, payment retry logic change at 14:32 UTC. Error rate spiked 4 minutes after deploy."

page_someone. The on-call engineer gets a Slack DM and email with the full context and the agent's findings. They don't start from zero.

change_incident_status to "fixing." The timeline has the whole story. When the fix ships, the engineer resolves it, or the CI/CD pipeline does via the API.

Later, create_postmortem with the root cause, timeline, and suggested action items. The engineer reviews and edits instead of writing from scratch.

A handful of calls. The agent did the running around. The engineer decided what to actually do about it.

Why we kept the tool set small

Most incident management MCP servers fall into two camps: auto-generated (every API endpoint becomes a tool, you end up with 70-100 in context) or hand-crafted but sprawling (30-70 tools covering every possible use case). Agents struggle with both.

Each tool definition costs 200-400 tokens (name, description, input schema). A server with 70+ tools burns tens of thousands of tokens before the agent even starts on your problem.

But the token cost is only part of it. The fewer tools an agent has to choose from, the more reliably it picks the right one. When there's one way to list incidents and one way to get an incident, the agent doesn't have to guess between list_incidents, get_incidents, search_incidents, and query_incidents.

We started with the workflow (what does an agent need to run an incident from alert to postmortem?) and worked backward to the tool set. No bulk operations. No user management. No webhook CRUD. No billing endpoints. If it doesn't help an agent run an incident, it stays out.

MCP works when you design for agents

There's a growing chorus that MCP is overhyped. That agents can't reliably use tools. That the whole thing is a gimmick.

We think it comes down to design. MCP (Model Context Protocol) does exactly what it says: lets an agent call tools with structured inputs and get structured outputs back. When an MCP server has well-named, well-described tools scoped to a single workflow, agents use them reliably. We've tested it.

The trick is treating tool design the same way you'd treat API design. Clear names. Descriptions written for LLMs, not humans reading docs. Each tool answers one question an agent would actually ask.

Getting started

Interactive setup (walks you through it):

npx @runframe/mcp-server --setup

Claude Code:

claude mcp add runframe -e RUNFRAME_API_KEY=rf_your_key -- npx -y @runframe/mcp-server

Cursor / VS Code, add to your MCP config:

{
  "mcpServers": {
    "runframe": {
      "command": "npx",
      "args": ["-y", "@runframe/mcp-server"],
      "env": { "RUNFRAME_API_KEY": "rf_your_key" }
    }
  }
}

Get your API key from Settings → API Keys after signing in. Scoped permissions, so give the key only what it needs.

Start a free 28-day trial at runframe.io, no credit card required. MCP is included. MIT licensed, source on GitHub.

What's next

We're going to be laser-focused on adding only what agents actually need. If a tool doesn't make an agent better at handling incidents, it doesn't ship.

On the short list:

Slack channel tools (create incident channels, post updates)
Analytics (MTTR trends, incident frequency by service)
Incident templates

That's it for now. We'd rather have 20 tools that work than 70 that look good in a README.

Common questions

What about write safety?

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

Niketa Sharma — Fri, 13 Mar 2026 00:00:00 GMT

Most OpsGenie alternatives lists are out of date.

FireHydrant got acquired by Freshworks. Squadcast got acquired by SolarWinds. Grafana OnCall went maintenance-only. Three tools that showed up on every comparison article either changed ownership or stopped shipping in the past year.

If you're migrating before the April 2027 shutdown, your options are different now than what most articles show. Here's what's actually available, what it costs once you add on-call, and how the teams we talked to made their decisions.

Disclosure: Runframe is our product. It's included alongside other options. Pricing last verified March 13, 2026.

What You'll Learn

Three tools changed status since mid-2025
Staying on Atlassian: JSM vs Compass
What it actually costs, advertised vs real price with on-call
The tools, grouped by what kind of team you are
How to decide without a 6-month evaluation
Common questions

The Market Shifted

Most OpsGenie comparison articles are out of date. Three tools that used to appear on every list changed status in the past year: FireHydrant was acquired by Freshworks (December 2025), Squadcast was acquired by SolarWinds (March 2025), and open-source Grafana OnCall entered maintenance mode and gets archived March 24, 2026. If those were on your shortlist, factor in the ownership changes. We cover them in detail later in this article. The rest of this guide focuses on what's actively shipping and independent.

Staying on Atlassian

Before looking elsewhere, know what Atlassian is offering. You might not need to leave.

JSM (Jira Service Management) is IT operations and ITSM: incident workflows, change management, service portals, asset management, knowledge base. If your team thinks in ITSM terms and you're already deep in Jira, this is the path.

Compass is engineering-focused: alerting, on-call, software catalog. Less overhead than JSM. Better fit if you want on-call without the ITSM weight.

One thing to watch: after migrating to JSM, alert data retention drops. Free gets 1 month, Standard gets 1 year, Premium gets 3 years (source). OpsGenie Enterprise had effectively unlimited retention.

Most teams we talked to didn't want to pick between JSM and Compass. They had one tool. Now Atlassian wants them to choose between two, figure out the feature overlap, or pay for both. That's what pushes people to look outside.

What It Actually Costs

This is where most comparison articles get it wrong.

OpsGenie bundled on-call and incident management in one price. Most alternatives don't. The headline price on a vendor's website is usually just incident response. On-call scheduling, the thing every OpsGenie team actually needs, is a separate line item.

Tool | What they advertise | What you actually pay with on-call | 20-person team, annual
Tool	What they advertise	What you actually pay with on-call	20-person team, annual
Runframe	$15/user/mo ($12 annual)	$12-15/user/mo, on-call included	$2,880-3,600
incident.io	From $19/user/mo ($15 annual)	$31-45/user/mo or $25-45 annual, on-call is a separate add-on	$6,000-10,800
Rootly	$20/user/mo per product	$40/user/mo for IR + on-call	~$9,600 (20 users)
PagerDuty	From $25/user/mo ($21 annual)	$49+/user/mo ($41 annual), many teams need Business tier plus add-ons	$9,840-30,000+
Grafana Cloud IRM	Free (3 users)	Billed per active IRM user/mo (first 3 free)	Varies by Grafana Cloud plan
Better Stack	Free tier	Varies by monitors and responders	Varies
FireHydrant	$9,600/yr (20 responders)	~$40/responder/mo, pre-acquisition, may change	$9,600+

The gap between advertised and actual price is bigger than you'd expect.

incident.io's Team tier is $19/user/month ($15 annual) for incident response. On-call scheduling is a separate add-on: +$12/user/month ($10 annual) on Team, +$20/user/month on Pro. So the real cost is $31/user/month or $25 annual (Team + on-call), up to $45/user/month (Pro + on-call). For a 20-person team on Team + on-call annual, that's $6,000/year, over double what you'd pay for tools that include on-call in the base price.

PagerDuty's Professional tier is $25/user/month ($21 annual). But many teams end up on Business at $49/user/month ($41 annual) once they need advanced escalation, analytics, and stakeholder notifications. Then there are add-ons: Status Pages ($89/month per 1,000 subscribers), AIOps ($699/month), PagerDuty Advance ($415/month). A 25-person team on Business with Status Pages alone is over $13,000/year.

Both are strong products. But if you're comparing on sticker price alone, the invoice will look different.

The Tools

Instead of ranking 1 through 7, here's what makes sense depending on who you are.

If your team lives in Slack

Three tools are built Slack-native, meaning Slack is the primary interface, not a bolted-on integration.

Runframe. Incident lifecycle + on-call in one tool, $12-15/user/month with everything included. Built for 10-200 engineers. Declare incidents, page on-call, update stakeholders, run postmortems, all from Slack. On-call scheduling with coverage visibility, escalation policies, SLA tracking, service catalog, RBAC, audit logs, Jira integration. Setup takes days, not months. No add-ons, no "contact sales." The price on the website is the price on the invoice. See pricing.

This is our product. We're biased. But if you want a similar "everything in one price" experience to what OpsGenie used to be, the concepts map over pretty directly:

OpsGenie Teams → Runframe Teams
Schedules → Runframe On-Call Rotations (primary + backup)
Escalation Policies → Runframe Escalation Rules
Integrations → Runframe Webhooks (Datadog, Prometheus, CloudWatch)

Built for teams of 10–200 engineers. We haven't battle-tested at 500+ or for heavy enterprise procurement requirements yet. See our OpsGenie → Runframe migration page.

incident.io. Deep Slack integration with strong workflows and AI-assisted postmortems. 1,500+ teams including Netflix and Etsy. Genuinely good product, particularly for mid-market to enterprise (50-500+ engineers). Their free Basic plan includes single-team on-call, enough for very small teams getting started. Once you need multi-team scheduling and escalation chains, you're on Team + the on-call add-on. Team is $19/user/month ($15 annual) for incident response, on-call adds $12/user/month ($10 annual) on top, so $31/user/month or $25 annual for the full package. Pro runs $25 + $20 for on-call = $45/user/month. Worth it if you need the depth and have the budget. Pricing source. See our full comparison.

Rootly. Slack-native with incident response and on-call sold as separate products, each at $20/user/month (Essentials). Incident response covers Slack-based coordination, workflow automation, channel creation, role assignment, status updates, Jira ticket creation, retrospectives, and a status page. On-call covers paging, scheduling, escalation policies, alert grouping, live call routing, and a mobile app. If you need both, that's $40/user/month. Rootly's strength is workflow customization. You can build multi-step automation rules that trigger based on severity, service, or team. They also have an AI SRE product sold separately. Enterprise tier with custom pricing. Pricing source. See our full comparison.

If you're already on Grafana

Grafana Cloud IRM. Makes sense if you're already in the Grafana ecosystem. Good alert routing and escalation. Free tier includes 3 active IRM users. Paid plans are billed per active IRM user per month. Beyond that, pricing scales with your Grafana Cloud plan. The self-hosted OSS option (Grafana OnCall) is going away, archived March 24, 2026. If you're not already on Grafana, this isn't the place to start. Pricing source.

If you're enterprise (200+ engineers)

PagerDuty. Built this category. Strong compliance, deep integrations, the most mature feature set. If you have dedicated SRE teams and complex service dependencies, it's still hard to beat. Professional is $25/user/month ($21 annual), but many teams end up on Business at $49/user/month ($41 annual) for advanced escalation, analytics, and stakeholder workflows. Add-ons like Status Pages, AIOps, and PagerDuty Advance push the cost up from there. At scale, the depth justifies it. Below 100 engineers, you're probably paying for configuration options you won't touch. Pricing source. See our full PagerDuty comparison.

If you want everything in one place

Better Stack. Monitoring, incidents, status pages, and on-call in one product. Free tier includes 10 monitors, a status page, 1 on-call responder, and Slack/email alerts. Paid plans are transparent and publicly listed.

If you're currently paying for OpsGenie plus a status page tool plus a monitoring tool, Better Stack could actually simplify things. You consolidate your monitoring and incident stack into one vendor instead of stitching together three.

It's broad, not deep though. If your main pain point during incidents is coordination, knowing who's doing what, keeping stakeholders updated, running postmortems that people actually read, Better Stack handles the alerting side well but doesn't go as far on coordination as Runframe, incident.io, or Rootly. For instance, if your main need is structured postmortem workflows, multi-team escalation chains, or real-time role assignment during incidents, you'll find those thinner here than in dedicated incident management tools. Pricing source.

Tools with recent acquisition risk

FireHydrant. Good product for runbook automation and service dependencies. Freshworks acquired it in December 2025. Freshworks announced the acquisition on December 15, 2025 (expected to close Q1 2026). FireHydrant is being folded into the Freshworks ecosystem alongside Freshservice. If acquisition risk is part of why you're leaving OpsGenie, this should give you pause. Atlassian acquired OpsGenie in 2018, and eight years later they're shutting it down. The risk was never a price hike. It was the product losing its independent roadmap. Pricing hasn't changed yet ($9,600/year for up to 20 responders), but the long-term question is whether it stays standalone or gets absorbed into Freshservice. See our full comparison.

Squadcast. Solid mid-market option at $9-12/user/month (Pro). SolarWinds acquired it in March 2025. Squadcast had competitive mid-market pricing ($9-12/user/month) and a startup-friendly positioning. Under SolarWinds, it sits inside an enterprise observability suite built for a very different customer. A year in, pricing has held, but SolarWinds serves enterprise IT teams, not the seed-to-Series C startups Squadcast was built for. The question is whether Squadcast's roadmap keeps serving that original audience. If you're evaluating it, check whether recent feature development still matches what you need.

How to Decide

You don't need a 6-month evaluation. Most teams overthink this.

Three things actually matter for OpsGenie migrants:

1. Does it include on-call in the base price? OpsGenie bundled everything. If your new tool charges separately for on-call, your real cost is higher than the price page suggests. Ask for the number that includes incidents + on-call + the features your team uses today. That's the number to compare.

2. Where does your team coordinate during incidents? If the answer is Slack, and for most teams it is, pick a tool where Slack is the primary interface. Not a sidebar integration. The difference shows up every time you're in an incident. Tools built around Slack handle creation, paging, status updates, and postmortems without leaving the channel. Tools that bolt Slack on require bouncing between a web UI and Slack on every incident.

3. Is the vendor independent? Two tools on this list got acquired in the past year. OpsGenie itself was acquired in 2018 and is being shut down 8 years later. If vendor stability matters to you, and it should given why you're reading this, factor in whether the tool you're evaluating could end up in the same situation.

Quick answer by team size:

Under 30 engineers: Runframe (free plan, Slack-native, everything included) or Better Stack (free tier, all-in-one)
30-200 engineers: Runframe ($12-15/user/month), Rootly ($40/user/month for IR + on-call), or incident.io Team + on-call ($25/user/month annual)
200+ engineers: incident.io Pro or PagerDuty Business
Already on Grafana: Grafana Cloud IRM
Want to stay on Atlassian: JSM or Compass

For the full migration playbook (timelines, data export, parallel run strategy, cost breakdowns), read our complete OpsGenie migration guide.

The short version

The OpsGenie alternatives market in 2026 is smaller than it looks. Remove acquired tools, sunset products, and options that need a separate on-call vendor, and the list gets short fast.

Figure out what your team actually needs: Slack-native or not, bundled on-call or modular, startup pricing or enterprise depth. Then check the real price, the one on the invoice with on-call included, not the one on the landing page.

We built Runframe for teams who want what OpsGenie used to be: incidents and on-call in one tool, one price, no surprises. Try it free

Common Questions

When does OpsGenie shut down?

Build, Open Source, or Buy Incident Management in 2026

Niketa Sharma — Tue, 10 Mar 2026 00:00:00 GMT

Your senior engineer says "give me a weekend with Claude and Cursor, I'll have a working incident bot by Monday." Copilot writes the paging logic. The escalation state machine practically builds itself.

They're right about the first version. They're wrong about the next three years.

We did back-of-napkin math on three-year total cost of ownership for a 20-person engineering team:

Build from scratch: $233K-$395K
Open source (self-host): $99K-$360K (mostly maintainer time)
Buy commercial: $11K-$83K (varies by vendor pricing model)

Sizing model (3-year):

Build_TCO = MVP + (FTE x LoadedCost x 3) + (Infra x 3) + Rebuilds
OSS_TCO = (FTE x LoadedCost x 3) + (Infra x 3) + Migrations
Buy_TCO = (Subscription x 3) + Onboarding

The bulk of this TCO is engineering time: opportunity cost, not vendor invoices. Building runs 3 to 8 times the cost of buying. Open source sits in the middle. Free to download, not free to run.

This article covers where the money actually goes, what AI tools change (and what they don't), and when building genuinely makes sense.

"You're not spinning up a bot. You're signing up to maintain a system forever."

Disclosure: Runframe builds incident management software. We've included open source options and noted when building is the right call. Found an error? Email hello@runframe.io.

60-Second Version

Under 20 people, no enterprise customers? Structured Slack workflows or incident-bot will get you started. Switch when you hit the limits.

Between 20 and 200 and scaling? Default to buying or self-hosting open source. Only build from scratch if you have real regulatory constraints or incident management is literally your product.

Over 200? You've likely outgrown basic tooling already. This article is mostly aimed at smaller teams, but the cost ratios still hold.

When I say "incident management" here, I mean the full loop: detection, paging, coordination, comms, and post-incident review. Not just "something that wakes people up."

If you just want the checklist, jump to When to Buy.

The AI Build: What Changed and What Didn't

Two years ago, a competent engineer needed 2-4 weeks to build a basic incident management system. Today, with AI coding tools, that's down to days. A weekend if you're scrappy.

AI is genuinely good at scaffolding. Slack bot setup that used to take days now takes hours. Status page templates, database schemas, escalation logic, API layers. The boilerplate disappears fast. No argument there.

But here's what AI doesn't change:

Slack retires APIs. It just does. The legacy file upload method was sunset in Nov 2025, forcing migrations to a newer upload flow. Legacy custom bots were discontinued in Mar 2025, breaking older bot-based workflows. AI can help you migrate faster, but it can't stop the deprecations from happening.

Phone and SMS paging is an ops problem, not a code problem. Carriers filter aggressively, especially internationally. Routing and deliverability are their own discipline. No prompt is going to fix that.

The engineer who leaves is still the single biggest risk. AI may have written the code, but nobody else knows the architecture decisions, the production edge cases, or why that one Slack workaround exists.

SOC2 auditors don't care that Claude wrote your audit log. They care that it's complete, immutable, and retained for the right duration. Compliance is process work, not code work.

And your incident tool needs to work at 2 AM when your infrastructure is failing. AI can't architect around your own blast radius.

The net effect: AI reduced the initial build from ~$19K-$31K (2-4 weeks) to maybe $8K-$15K (1-2 weeks) in engineer time. That saves ~$10K-$15K of Year-1 cost on a $233K-$395K three-year total. The initial build was never the expensive part.

More Code, More Incidents

Before we get into the numbers: the problem you're solving isn't standing still.

AI-assisted development pushes change velocity up for most teams. Faster velocity usually means more incidents, unless review and testing discipline keeps pace. The blast radius gets bigger when AI-generated changes don't get the same scrutiny as hand-written code. More code shipped faster means more things that can break.

The incident management tool you need in year three will almost certainly be bigger than what you need today.

The Build Illusion: Why It Seems Cheaper Than It Is

With AI coding tools, a good engineer can stand up a basic incident system in days:

Slack bot that creates channels
Basic status page
Escalation logic
Incident history in a database

Looks straightforward. Here's what teams consistently forget.

The Hidden Cost: Dedicated Engineer

Someone needs to own this. Not as a side project. As actual job responsibility.

Example (B2B SaaS running microservices on Kubernetes, ~120 engineers): A team assigned a senior engineer to their custom incident tool "for a quarter." Three years later, it's still a quarter of his time. The original Slack bot had grown to include custom escalation logic, a homegrown status page, and integrations with five internal tools nobody else knew how to maintain.

US high-cost-market senior engineer fully-loaded (salary + benefits + overhead) is often ~$250K-$400K/year. Adjust down ~30-50% for UK/EU typical comp.

Even at 25% allocation, that's $62K-$100K annually in opportunity cost. For one feature.

Sensitivity check: If your maintainer is 0.1 FTE instead of 0.25 FTE, subtract ~$25K-$40K/year from the build cost below. But be honest as 0.1 FTE is rarely enough once the tool is in production.

The Maintenance Tax

SREs have a name for this: the forever-project. What started as a weekend hack becomes a quarter-long effort, then a year-long commitment, then something nobody wants to touch but everyone relies on.

The first three months are fine. The engineer builds it, it works, everyone's happy. Then edge cases start appearing around month four. Slack changes its permission model, or rate limits hit during a real incident, or a new hire asks "why does it work this way?" and nobody has a good answer. The original engineer spends increasing time on support.

Somewhere between month seven and month twelve, the engineer who built it leaves or changes roles. Nobody else understands the code. The team is afraid to touch it. By year two, the tool has real technical debt, nobody wants to work on it, but everyone depends on it.

The Policy Surface Nobody Expects

Once you have an incident system, questions show up that you didn't plan for. Who can declare incidents? Who can close them? How long do you keep the records? Where's the data stored? Can you export it for an audit?

Every internal tool eventually becomes a policy surface. Building the first version is cheap. Keeping up with evolving RBAC, retention, and compliance requirements is where the real time goes.

One pattern we've seen across regulated teams: a 60-person fintech spent ~$80K of engineering time building an incident system. It worked for 18 months. Then Slack platform changes and internal security policy changes hit at the same time. The engineer who built it had left. They spent another ~$40K rewriting it. Six months later, compliance asked for audit trails and data residency. Another ~$30K.

The Reliability Paradox

During a P0, when the database is on fire, customers are angry, and your CEO is watching the Slack channel, your incident tool has to work. Without question.

But most teams host their custom incident tooling on the same infrastructure as their product. Product goes down, incident tool goes down with it. If your internal tool uses the company SSO, you're locked out of your response system the moment your identity provider is part of the outage.

Your incident tool needs different infrastructure than your app. It needs higher availability. It needs separate backups. It needs its own monitoring.

AI tools reduce initial build time. They don't fix the reliability paradox, the policy surface, or the engineer who leaves.

If You Build Anyway

A few things that catch teams off guard: Slack's permission model is more nuanced than it looks, and scoping channel access without granting overly broad permissions is tricky. Bulk operations during real incidents hit rate limits. Phone and SMS paging has deliverability issues that vendors spend years solving. And rebuilds break because nobody remembers why policy X was implemented that way. You either rebuild the wrong thing or spend weeks rediscovering context that left with the original engineer.

If you're going to build regardless, at minimum get these right:

Separate hosting from production (different failure domain)
Paging + escalation state machine (including acknowledgements)
Timeline capture + export (for post-incident review and compliance)
Audit log of key actions (declare, assign, close)

The Real Cost Comparison (20-Person Company, 3-Year TCO)

Back-of-napkin estimates for a 20-person engineering team. Your specific numbers will differ, but the ratios are what matter.

Build from scratch: $233K-$395K. Self-host open source: $99K-$360K. Buy commercial: $11K-$83K.

Building typically runs 3 to 8x the cost of buying, depending on vendor tier and team size. Open source falls in between. No license fees, but the maintainer time adds up.

Where the numbers come from: Levels.fyi's 2025 report shows ~$312K median total compensation for "Senior Engineer" in the US (base + stock + bonus). We applied a standard 1.25-1.4x multiplier for employer-side costs (benefits, payroll taxes, overhead) to get the $250K-$400K fully-loaded range. Adjust down 30-50% for UK/EU. Infrastructure costs are based on AWS pricing for a 3-AZ highly available setup with separate monitoring. Rebuild risk is informed by the Slack deprecations mentioned above, plus typical security and compliance changes over a 3-year window. The ratio held across every scenario we sketched: build costs 3 to 8 times more than buying.

Assumptions: 1-2 weeks initial build time (AI-assisted), 0.25 FTE ongoing maintenance, separate infrastructure for reliability, and periodic rework every 18-24 months for API changes, compliance, and new features.

Plug in your own numbers:

Inputs:
  EngCost     = Fully-loaded eng cost/year (default: $300K)
  BuildWeeks  = Initial build time in weeks (default: 1-2)
  FTE         = Maintainer allocation (default: 0.25)
  Vendor      = Vendor $/user/month (default: $15-100; depends on pricing model)
  Users       = On-call responders (default: 10-15; set to 20 if everyone is a responder)
  Infra       = Hosting/monitoring per year (default: $5K; set to $0 if N/A)
  Rebuild     = Migration/rewrite allowance over 3 years (default: $30K; set to $0 if none)
  Onboarding  = One-time setup/training (default: $5K; set to $0 if self-serve)

Formulas:
  Build cost       = (EngCost / 52) × BuildWeeks
  Run cost/year    = EngCost × FTE
  Buy cost/year    = Vendor × Users × 12
  Build 3-yr TCO   = Build cost + (Run cost/year × 3) + (Infra × 3) + Rebuild
  Buy 3-yr TCO     = (Buy cost/year × 3) + Onboarding

Build (3-year TCO):

Cost | Year 1 | Year 2 | Year 3 | Total
Cost	Year 1	Year 2	Year 3	Total
Initial build (AI-assisted, ~1-2 weeks)	$8K-$15K	$0	$0	$8K-$15K
Dedicated maintainer (25% time)	$62K-$100K	$62K-$100K	$62K-$100K	$186K-$300K
Infrastructure & hosting*	$3K-$10K	$3K-$10K	$3K-$10K	$9K-$30K
Rebuilds & migrations**	$0	$30K-$50K	$0	$30K-$50K
Total	$73K-$125K	$95K-$160K	$65K-$110K	$233K-$395K

*Depends on HA requirements, pager/telephony, audit logging, retention, data residency.
**Triggered by Slack platform changes, org restructuring, compliance requirements, new escalation policies, or new integrations.

Buy (3-year TCO example for a 20-person company):

Cost | Year 1 | Year 2 | Year 3 | Total
Cost	Year 1	Year 2	Year 3	Total
Responder-based pricing (10-15 users × $15-30/mo)***	$2K-$5K	$2K-$5K	$2K-$5K	$5K-$16K
Enterprise per-seat pricing (20 users × $40-100/mo)***	$10K-$24K	$10K-$24K	$10K-$24K	$29K-$72K
Onboarding & setup	$3K-$8K	$0	$0	$3K-$8K
Total (responder-based)	$5K-$13K	$2K-$5K	$2K-$5K	$11K-$27K
Total (enterprise per-seat)	$13K-$32K	$10K-$24K	$10K-$24K	$35K-$83K

***Vendor pricing varies widely. Responder-based tools (pricing per on-call user) are typical for startups and mid-size teams. Enterprise per-seat licensing (pricing per employee) is common with PagerDuty, Opsgenie, and similar tools at higher tiers.

Open source / self-host (3-year TCO example for a 20-person company). Totals below show the same table under two maintainer assumptions (0.1 FTE optimistic vs 0.25 FTE typical):

Cost | Year 1 | Year 2 | Year 3 | Total
Cost	Year 1	Year 2	Year 3	Total
Dedicated maintainer (0.1-0.25 FTE)	$25K-$100K	$25K-$100K	$25K-$100K	$75K-$300K
Infrastructure & hosting*	$3K-$10K	$3K-$10K	$3K-$10K	$9K-$30K
Upgrades & migrations**	$0	$15K-$30K	$0	$15K-$30K
Total (0.1 FTE)	$28K-$50K	$43K-$80K	$28K-$50K	$99K-$180K
Total (0.25 FTE typical)	$65K-$110K	$80K-$140K	$65K-$110K	$210K-$360K

0.1 FTE is optimistic (works if you're deploying a mature tool with minimal customization). 0.25 FTE is typical once you're running it in production with Slack integrations and on-call routing.

*Depends on HA requirements, audit logging/retention, and whether you run paging/telephony yourself.
**Common triggers: Slack API changes, auth/security model changes, major version upgrades, or compliance asks (RBAC/audit/retention).

Sensitivity check: Your numbers will differ based on location, team structure, and requirements. If your maintainer is 0.1 FTE instead of 0.25 FTE, subtract ~$25K-$40K/year from build costs. If you avoid rebuilds entirely, subtract another $30K-$50K. The gap narrows but rarely closes. Under typical assumptions, build costs 3-7x more than buy over three years.

The gap is wider than most teams think. And the build model assumes nothing catastrophic happens - no major rewrites, no security incidents, no key engineer departures.

How the Numbers Change

Most arguments about build vs buy come down to two variables: how much time the maintainer actually spends, and how the vendor prices seats.

If you're optimistic and assume 0.1 FTE with no rebuilds, build drops to ~$92K-$165K over 3 years. That narrows the gap with buying considerably. But 0.1 FTE rarely holds once the tool is in production and people start requesting features.

Under typical assumptions (0.25 FTE, one rebuild or migration event, normal Slack and compliance churn), build and self-host run 3-8x the buy-side cost.

The one scenario where buying looks less attractive: if your vendor prices per employee rather than per responder, and you're forced into a higher enterprise tier. In that case, self-hosting can be rational, but only if you can name an owner and accept the upgrade burden.

The Open Source Path

Open source is a legitimate option if you want to avoid both building from zero and paying license fees. But the options shrank considerably in 2025.

Netflix archived Dispatch in September 2025. It was the most production-ready self-hosted option for years. It's read-only forever now. Netflix had hundreds of engineers maintaining it and still walked away.

Grafana closed-sourced OnCall. The OSS version entered maintenance mode in March 2025 and is scheduled to be fully archived on 2026-03-24. Cloud connection, SMS, phone, and push notifications all stop working after that date. Grafana consolidated everything into a closed-source Cloud IRM product.

Two of the biggest names in open source incident management either archived or closed-sourced their tools in the same twelve-month window. That's the context for what follows.

What's Actually Left

Incidental has Slack integration and status pages, with a hosted option at incidental.dev. It's the most capable truly open source option remaining, though it's still early-stage (v0.1.0).

incident-bot (docs) is Slack-based, self-hostable, Python/PostgreSQL. Integrates with PagerDuty, Jira, Confluence, Statuspage, GitLab, and Zoom. Smaller project, limited on compliance and RBAC out of the box.

Both are MIT licensed. Both are small projects compared to what Dispatch and Grafana OnCall were.

Also worth knowing: IncidentFox is an AI-powered SRE platform. The core is Apache 2.0, but the production security layer (sandbox isolation, credential injection) is BSL 1.1, meaning production use of those components requires a commercial license. Read the LICENSING.md before deploying.

The tradeoff with open source is straightforward. You eliminate licensing cost but not maintenance cost. Someone still owns upgrades, security patches, Slack API changes, and the 2 AM call when it breaks. Budget 0.1-0.25 FTE and treat it like a vendor relationship, not a one-time install.

The Hybrid Approach

In practice, few teams go fully build or fully buy. What works best for most is buying or self-hosting the core workflow (alerting, escalation, timeline) and building custom integrations on top. That gets you 80% of the value at 20% of the maintenance burden. This is where AI coding tools genuinely earn their keep: writing glue code between your incident tool and internal systems, not building the core tool itself.

If you go the self-host route with Incidental or incident-bot, treat it like a vendor relationship. Dedicate an owner, budget for regular upgrades, plan for Slack API changes. "It's free" doesn't mean "it's free of work."

And if you're small enough that none of this feels urgent yet, start with a structured Slack workflow and switch when you hit the triggers in the checklist below. Don't prematurely optimize, and don't wait until you're drowning.

Four Questions to Answer Honestly

Before you commit either way, answer these honestly:

Can you name the person who will own this for the next two years? Not "the team" or "we'll rotate it." A specific person with time allocated. If the answer is "we'll figure it out," you should buy.

What happens when that person leaves? If the code is well-documented, tested, and multiple people understand it, you're probably fine. If it's one person's project that nobody else has touched, you're building a liability.

Is your incident tool on separate infrastructure from your product? Because if it shares the same database, the same deploy pipeline, the same SSO, it goes down when your product goes down. Most teams that build in-house make this mistake, and it only becomes obvious during a real P0.

What else could your engineers be working on? A senior engineer spending 25% of their time on an internal incident tool is a senior engineer not spending 25% of their time on your product. At $62K-$100K/year in opportunity cost, that's a real number.

Decision Checklist: When to Buy

Triggers that suggest you're ready for a dedicated incident management platform:

On-call rotation involves ≥8 people
You're handling ≥4 incidents per month
≥3 teams are regularly involved in incident response
You have customer-facing SLAs or enterprise customers asking about incident processes
Compliance requirements exist (audit logs, retention, RBAC)
You need stakeholder updates within 10-15 minutes, reliably
Your current ad-hoc system failed during a real incident

If 3+ apply, you're in buy territory.

When Building Actually Makes Sense

I want to be fair here. There are teams where building is genuinely the right call.

If you have regulatory constraints that no vendor can meet (specific data residency requirements, mandated audit log formats, custom approval workflows tied to proprietary systems), building makes sense. If you're 500+ engineers with complex multi-team incident processes, off-the-shelf tools may not fit, though at that scale you have a team dedicated to internal tooling anyway.

Sometimes building is educational, and that's fine too. Just be honest that it's a learning project, not a production system, and budget for the eventual rewrite.

When it actually works

An 80-person fintech we've talked to had to build because they needed EU data residency for specific customers (specific region, specific provider), custom approval workflows for production access tied to their fraud detection system, audit log formats mandated by regulators that weren't standard JSON, and integrations with internal systems no vendor supported.

Three years later, it's still maintained by 0.3 FTE of an SRE. Total cost was ~$250K-$300K over 3 years, versus maybe $200K-$270K if they'd bought and built all the custom integrations on top. They'd build again, because their requirements stayed genuinely unique.

The key word is "genuinely." Their requirements were regulatory constraints, not preferences. Most teams think they're unique. Few actually are.

Why Most Teams Should Buy

For teams between 20 and 200, buying is almost always the better move. Not because building can't be done (it clearly can) but because the economics don't justify it.

Your custom tool doesn't evolve unless you invest in it. Paid tools ship new features based on what hundreds of teams need. When Slack changes its API, vendors ship updates within weeks because it's their business. You don't own the maintenance, the security patches, or the upgrade cycles.

There's also the exit option. If you build something custom and hate it, you're stuck with it. If you buy and it doesn't work out, you switch. That flexibility is worth more than most teams realize.

And the reliability argument is simple: dedicated incident management vendors have higher uptime requirements than your startup does. Their whole business is being available when your stuff is broken.

What to Buy First

Don't try to solve everything at once. Start with paging and escalation that reliably works on phone and SMS, timeline capture so you have a record of what happened, and comms templates for stakeholder updates. That's day one.

Within six months, add a status page, basic analytics (MTTR, incident frequency), and a post-incident review workflow. Everything else (advanced reporting, custom integrations, SLA tracking) can wait until you know what you actually need.

Where AI Actually Helps

The highest-value use of AI in incident management isn't building the tool itself. It's features within the tool: auto-generated postmortem drafts, smart alert grouping, runbook suggestions. Apply AI where it saves time during and after incidents, not on maintaining the infrastructure underneath. For a real example, see how AI agents can manage incidents via MCP.

Migration: What Actually Breaks

If you're migrating from a custom build to a commercial tool, expect three kinds of friction.

Incident ID schemes don't map cleanly. Your custom tool used INC-2024-001, the new tool uses #1234, and now every cross-reference in Jira, docs, and Slack is broken. Team habits reset too. Muscle memory around commands, templates, and workflows takes 2-4 weeks to retrain, and the first few weeks feel slower, not faster. And historical metrics become discontinuous when you switch tools mid-year, which makes year-over-year MTTR comparisons messy.

None of these are dealbreakers. But budget 2-4 weeks for the transition and expect a productivity dip.

The Bottom Line

Building has never been easier. That's exactly the trap.

AI tools compress the initial build from weeks to days. But the initial build was never where the money went. Maintenance, reliability, compliance, and the person who owns it. That's the real cost, and AI doesn't touch any of it.

The question worth asking isn't "can we build this?" It's "do we want to own this for the next three years?"

If incident management is core to your business and you have dedicated ownership and separate infrastructure, build. If you want genuinely open source, Incidental and incident-bot are MIT licensed and real options, though you're trading licensing cost for maintenance cost. If you're a 20-200 person team that wants something that works without dedicating engineering time to maintain it, buy. The market is moving toward Slack-first coordination and responder-based pricing; PagerDuty still wins in mature enterprises but is often overkill for teams under 200.

Most teams end up somewhere in between: buy or self-host the core, build the custom parts with AI. That's usually the right answer.

Common Questions

Should we build incident management in-house?

Slack Incident Management: What Works and What Breaks

Niketa Sharma — Sun, 08 Mar 2026 00:00:00 GMT

Every engineering team starts incident management the same way. Someone posts in #engineering: "prod is down." Three people reply, two investigate the same thing, and the one person who actually knows the affected service is asleep.

This works at 10 engineers. Everyone knows who owns what, the blast radius is small, and you can still hold the whole system in your head.

By 25 engineers, you're running incidents across five different Slack channels with no idea who's actually on-call. A new engineer asks "which channel?" and nobody answers because everyone assumes someone else will. The CEO finds out from a customer tweet.

This is a guide for teams that run incidents in Slack. Not the theoretical version from SRE textbooks. The real version, including where Slack helps, where it breaks, and when you need something more.

How Teams Actually Run Incidents in Slack

There are three approaches, and most teams use some messy combination of all three.

Approach 1: The Manual Channel

Someone declares an incident by creating a Slack channel. Usually #inc- or #incident- followed by whatever seemed descriptive at the time. People get invited manually. Updates happen in the channel. When it's resolved, someone posts a message and everyone forgets about the channel.

This is where every team starts. It's fine for rare incidents. It falls apart when:

Two incidents happen at once and people end up in the wrong channel
Nobody remembers to invite the on-call person
The resolution message gets buried in a thread
Three months later, nobody can find what happened during that outage in February

The biggest problem isn't the process. It's that everything depends on one person remembering eight steps in the right order while production is on fire.

Approach 2: The Homegrown Bot

At some point, someone builds a Slack bot. Usually a Python script that listens for /incident and auto-creates a channel with a standard naming convention. Maybe it pings the on-call rotation from a spreadsheet. Maybe it posts a template message.

This is a real upgrade. Channel names become consistent. The initial response message always includes severity and a link to the dashboard. On-call gets notified automatically.

Then the engineer who built it changes teams. Slack APIs, permissions, and platform behavior change. The bot starts creating duplicate channels or missing edge cases, and nobody wants to touch the 400 lines of callback spaghetti with hardcoded credentials on a forgotten EC2 instance.

The bot works great for a while, then slowly rots. If you've worked at more than two startups, you've seen this movie.

Approach 3: Dedicated Tooling

PagerDuty, incident.io, Rootly, FireHydrant, Runframe. Tools that handle the entire incident lifecycle through Slack: creation, assignment, severity, escalation, timeline capture, and post-incident review.

The upside is obvious. Consistent process. Automatic audit trail. On-call routing that actually works. No bot maintenance.

The downside is real too. You're adding a dependency. Setup takes time. Every team member needs to learn the commands. And you're paying for it.

Most teams resist this transition longer than they should, not because of cost but because of setup fatigue. They've been burned by tools that promise "5-minute setup" and turn into two weeks of configuration and permissions wrangling.

Where Slack Actually Works for Incidents

Slack is good at real-time coordination. That's genuinely valuable during incidents.

Dedicated channels create focus. A single channel per incident means everyone involved sees the same information. No cross-talk from other conversations. No "did you see my message in #engineering?" The channel IS the incident.

Slash commands reduce friction. /inc create database-outage is faster than opening a dashboard, clicking through a form, and filling in 6 fields. Engineers are already in Slack. Meeting them there removes a context switch at the worst possible moment.

Message history becomes the timeline. Every message in the incident channel is a timestamped record of what happened. Who said what, when. What was tried. What failed. This is the raw material for your post-incident review, and Slack captures it automatically.

Reactions and threads handle the small stuff. Eyes emoji to signal "I'm looking at this." White check mark for "done." Threads keep debugging details and log dumps out of the main channel. These are small things, but during a fast-moving incident, keeping the main channel clean for critical updates and using reactions instead of status messages reduces noise.

Where Slack Breaks for Incidents

Slack was built for team messaging. It was not built for incident management. The gaps show up fast.

There's no canonical status

Slack is a stream of text. It has no concept of "the current state of this incident." No severity field. No status tracker. No assignment. No single place that answers "what's happening right now?"

The current status is whatever the last person typed. Scroll up to find it. Hope it's still accurate. "What's the current status?" becomes the most-asked question in every incident channel. Three people stop investigating to type the same answer.

Threads make it worse. Someone posts a root cause finding in a thread. Half the responders don't see it because they're watching the main channel. Thread replies don't surface unless someone checks "Also send to channel." Most people forget. Critical information ends up buried two clicks deep.

Notifications fail when they matter most

The 2 AM page needs to wake someone up. Slack notifications are unreliable for this. Do Not Disturb overrides them. Phone notifications get grouped and silenced. Push delivery depends on Apple's and Google's notification infrastructure, which has no SLA.

For paging, you need phone calls or SMS with carrier-level delivery. Slack is the coordination layer, not the alerting layer. Teams that confuse the two miss pages.

Audit trail gaps

Slack messages can be edited and deleted. On lower-tier plans, retention limits and search restrictions mean you might not be able to find what happened during last quarter's outage.

If you need to demonstrate to auditors that you followed your incident process, Slack alone isn't enough. You need something that captures the timeline immutably, outside of Slack's retention rules.

On-call routing doesn't exist

Slack doesn't know who's on-call. There's no rotation concept. No escalation policy. If the primary doesn't respond in 5 minutes, Slack can't automatically page the backup.

This is why most teams layer an on-call tool on top. Slack handles coordination. The on-call tool handles routing. The problem is now you're context-switching between two systems during a live incident.

The Inflection Points

You don't need to formalize your incident process on day one. But there are clear moments when the informal approach stops working.

When you're handling more than one incident at a time

Two concurrent incidents in the same #incidents channel is chaos. People talking past each other. Updates for incident A getting mixed with questions about incident B. This is usually the first sign you need dedicated channels per incident.

When a new engineer gets paged and freezes

Your new hire gets their first page at 11 PM. They open Slack. There's no runbook pinned anywhere. They don't know if this is a SEV1 or a SEV3. They post in #engineering: "I think something's wrong with payments?" Nobody responds for 12 minutes because the people who would know are in a different timezone. By the time someone helps, the customer has already tweeted about it.

That's not a documentation problem. It's a process problem. If your incident response depends on context that lives in three people's heads, every new on-call rotation is a coin flip.

When incidents aren't getting reviewed

If your post-incident process is "someone writes a Google Doc when they feel like it," you're not learning from incidents. The information exists in the Slack channel, but extracting it into a useful review is manual, tedious work. So it doesn't happen.

When you pass 20-25 people

Above 20-25 engineers, teams are specialized enough that "whoever's around" on-call stops working. You need formal rotations, clear escalation paths, and a process that doesn't depend on tribal knowledge.

When compliance enters the picture

SOC2 (or ISO 27001) auditors want to see that you have an incident management process, that you follow it, and that you can prove it. Slack screenshots don't cut it. You need structured records: when the incident was declared, who responded, what the severity was, when it was resolved, and what the follow-up actions were.

Setting Up Slack Incident Management That Works

If you're formalizing your process, here's what to get right regardless of whether you use a tool or build it yourself.

1. One channel per incident, auto-created

Naming convention matters. inc-042-payment-api-timeout tells you the incident number, what it is, and makes it searchable later. Manual channel creation is the first thing to automate because it's the first bottleneck during an incident.

2. Severity in the channel topic

Set the channel topic to include severity, status, and incident commander. /topic SEV1 | Investigating | IC: @alice gives anyone who joins the channel immediate context without asking.

3. A single command to declare

Whether it's /inc create or a custom bot command, the declaration should do everything: create the channel, set the severity, notify the on-call person, and post the initial context. One command, not five manual steps.

4. Automatic on-call notification

The right responder should be notified automatically based on the affected service, ownership map, and escalation policy. This is where most DIY setups fail. Maintaining an accurate on-call schedule in a spreadsheet or JSON file is a losing battle.

5. Timeline capture that doesn't depend on humans

Every message in the incident channel should be captured as a timeline entry. Automatically. Not "someone remembers to take notes." The automatic transcript is what makes post-incident reviews actually happen, because the raw material already exists.

6. Status updates on a cadence

For SEV1 and above, post a status update every 15-30 minutes. Not when someone asks. On a schedule. This reduces repeated status requests and keeps stakeholders informed without them joining the channel and adding noise.

7. Clear escalation path

When the primary on-call can't resolve it, what happens? If the answer is "ping someone in Slack and hope they see it," you'll miss escalations. Define the path: primary to backup to team lead to engineering manager. Automate it if you can.

Tools vs. DIY: The Real Tradeoff

Building a Slack bot for incident management is straightforward. The initial bot takes a weekend. Creating channels, posting templates, pinging on-call from a schedule. That part isn't hard.

The hard part is everything after:

Slack APIs, permissions, and platform behavior change regularly. Internal bots that nobody actively maintains break in small but painful ways.
On-call schedules change weekly. Someone has to update the source of truth.
Escalation logic has edge cases. What if the primary is in a different timezone? What if the backup is also on PTO?
Phone and SMS paging is an ops problem, not a code problem. Carrier routing, international delivery, deliverability filtering.
Audit logging for compliance needs to be immutable and retained for the right duration.
The engineer who built the bot leaves. Nobody else understands the code.

The question isn't "can we build this?" It's "do we want to maintain this for three years?" For most teams above 20-25 people, the answer is no. The total cost of ownership of a homegrown solution is higher than most teams expect.

The best Slack-native incident tools don't pull engineers out of Slack for the critical path. They keep declaration, coordination, escalation, status updates, and timeline capture inside the channel while giving you structured incident records outside Slack. The bar isn't "does it have a Slack integration." It's "does it remove process overhead during a live incident?" We built Runframe to clear that bar.

What Good Looks Like

It's 2:14 AM. Your monitoring fires a SEV1 alert. The on-call engineer's phone rings. She picks up, half awake, opens Slack. The incident channel already exists. The channel topic says SEV1 | Payment processing failure | IC: @alice. Alert context is pinned: which service, which region, when it started, link to the dashboard. The escalation policy already notified the payments team lead.

She types /inc update investigating connection pool exhaustion in payments-api-east and the status is captured. Stakeholders see the update without interrupting. Nobody asks "what's the current status?" because it's right there, updated automatically.

Forty minutes later, the fix is deployed. She runs /inc resolve connection pool limit increased, root cause was config drift after Tuesday deploy. The timeline is already written. Tomorrow's post-incident review starts from that transcript, not a blank page.

Compare that to the alternative: her phone buzzes with a Slack notification she almost sleeps through. She scrolls through #engineering trying to find the alert. Creates a channel, can't remember the naming convention. Manually pings three people. One is on vacation. Twenty minutes in, someone asks "is this a SEV1 or SEV2?" and the actual debugging hasn't started.

The difference isn't heroics or talent. It's whether your process works when the person running it is half asleep and stressed.

Slack is excellent for coordination. It is not, by itself, an incident management system. Once you need to page the right person, track severity, prove to auditors what happened, and make sure the same process runs at 2 AM as it does at 2 PM, chat alone stops being enough.

Common Questions

What's the difference between Slack incident management and using PagerDuty with Slack?

PagerDuty Alternatives 2026: Pricing and Features Compared

Niketa Sharma — Thu, 05 Mar 2026 00:00:00 GMT

Nobody switches incident management tools for fun.

You migrate escalation policies. You retrain engineers. You pray nothing breaks during cutover. Most teams put it off for months.

So when teams do switch away from PagerDuty, it's worth asking why. We spent the last few weeks reading what engineers are saying on Reddit, Hacker News, G2 reviews, and in direct conversations.

Six PagerDuty alternatives worth evaluating in 2026 are Runframe, incident.io, Rootly, Grafana Cloud IRM, Better Stack, and FireHydrant. Each fits a different team size and budget. Here's how to pick the right one.

Disclosure: Runframe is our product. It's included alongside other options. The rest of this list is based on public pricing, community sentiment, and published vendor information. Pricing checked March 2026.

What You'll Learn

The OpsGenie shutdown and why it matters now
Why teams are evaluating PagerDuty alternatives
Quick picks by use case
Full comparison table
6 alternatives worth evaluating (and a 7th you might not expect)
How to pick the right tool for your team size
Migration checklist

The OpsGenie Factor

Before we get into PagerDuty alternatives, there's a catalyst reshaping this market right now.

Atlassian is shutting down OpsGenie. New sales ended June 4, 2025. Full shutdown hits April 5, 2027 (~13 months from March 5, 2026). Thousands of teams need to migrate (source).

Atlassian is directing users to Jira Service Management or Compass. After migrating to JSM, alert data is subject to plan-based retention: Free gets 1 month, Standard gets 1 year, Premium gets 3 years (source). OpsGenie Enterprise supported effectively indefinite alert retention. Many teams are using this as a chance to evaluate the full market, not just move to another Atlassian product.

Even if you're on PagerDuty, this matters. Thousands of teams evaluating tools at the same time means alternatives are competing harder on pricing and features. It's a good time to be a buyer.

We wrote a full OpsGenie migration guide and an OpsGenie alternatives comparison if that's your situation.

Why Teams Are Looking

PagerDuty built a category. It solved a real problem in 2009: reliable alert delivery. For large organizations with 100+ services and dedicated SRE teams, it's still a strong choice.

But the bottleneck shifted. Alert delivery isn't the hard part anymore. Coordinating the response, keeping stakeholders updated, running postmortems that people actually read—that's where teams lose time now.

Three patterns keep coming up:

The pricing math changed

PagerDuty does not offer a free tier. List prices (before discounts) are $21/user/month (Professional) and $41/user/month (Business). Most teams need add-ons. Status Pages list at $89 per 1,000 subscribers/month (source). AIOps starts at $699/month (source). PagerDuty Advance is $415/month on an annual plan (source).

Example: 25 people on Business = ~$12,300/year list. Add Status Pages + AIOps + Advance and you can exceed $30,000/year. Enterprise contracts vary, so these list prices are a starting point, not the final number.

Pricing comes up frequently in recent public reviews. Many reviewers mention paying for features their team doesn't actively use.

The feature set outgrew smaller teams

PagerDuty has an enormous feature set. For teams running complex service dependencies with dedicated SRE, that depth matters.

For teams at 10-80 engineers who need on-call rotation, escalation, and coordination, it can be more than they'll ever configure. Scheduling, holiday management, and overrides are common friction points. New hires find the setup overwhelming when all they need is to know who's on call.

This isn't a knock on PagerDuty. It's a fit question. A tool built for 500-person orgs works differently than one built for 30-person teams.

Incident work moved to Slack

Alert fires at 3 AM. The on-call engineer gets paged, then opens Slack. Creates a channel. Pulls in teammates. Status updates, decisions, postmortem discussions: all in Slack.

This creates a context-switching loop. PagerDuty's web UI handles alert management. Slack handles the actual coordination. You bounce between the two on every incident.

PagerDuty has been improving its Slack integration, but tools like incident.io, Rootly, and Runframe were designed with Slack as the primary interface from day one. That's a different starting point, and it shows up in the daily workflow.

Quick Picks

If you need | Look at
If you need	Look at
Slack-native incident management	incident.io, Rootly, Runframe
All-in-one monitoring + paging + status page	Better Stack
Already on Grafana	Grafana Cloud IRM
Guided PagerDuty migration	FireHydrant
Startup-friendly pricing (10-200 engineers)	Runframe
Enterprise scale + Slack-native workflows	incident.io

Comparison Table

Tool | Starting price | Best for | Slack-native | Free tier
Tool	Starting price	Best for	Slack-native	Free tier
Runframe	$15/user/month ($12 annual)	10-200 engineers, startups	Yes	Yes
incident.io	$19/user/month ($15 annual) + on-call	50-500+ engineers, enterprise	Yes	Yes (Basic)
Rootly	Usage-based	Teams focused on coordination	Yes	No
Grafana Cloud IRM	Free: 3 users. Pro: $19/mo + $20/active user above 3	Grafana ecosystem teams	No	No
Better Stack	Free tier available	Small teams wanting all-in-one	No	Yes
FireHydrant	$9,600/year (20 responders)	Teams wanting runbook automation	No	No

Alternatives Worth Evaluating

Not every PagerDuty alternative is worth your time. Here are six that are, plus a seventh option you might not expect.

1. Runframe

For: Engineering teams with 10-200 engineers who've outgrown scripts and spreadsheets but don't want to pay enterprise prices for features they'll never use.

This is what we build. So we'll be direct about what it does and where it falls short.

Runframe gives you the full incident lifecycle in one tool: on-call scheduling with coverage gap analysis, incident coordination with war rooms, escalation policies, SLA tracking, a service catalog, AI-powered postmortems, RBAC, audit logs, and Jira integration. Monitoring comes in via Datadog, Prometheus, and AWS CloudWatch webhooks. Everything runs through Slack. Declare incidents, page on-call, update stakeholders, all without leaving the channel.

Setup takes days, not quarters. No dedicated admin required.

Pricing: Free plan. $15/user/month, or $12 annually. No add-ons. No "contact sales." See pricing.

Not the right fit if: You're operating at enterprise scale with hundreds of services, complex dependency management, or strict compliance/procurement requirements. In those cases, PagerDuty or incident.io may be a better fit. Full comparison.

2. incident.io

For: Mid-market to enterprise teams (50-500+ engineers) with budget for a premium tool.

Deep Slack integration. Strong workflows. AI-assisted postmortems. 1,500+ teams including Netflix and Etsy. Raised $62M Series B in 2025 for AI incident resolution.

Pricing (source): Basic free (single-team on-call). Team: $19/user/month ($15 on annual) + $10/user/month on-call add-on. Pro: $25/user/month + $20/user/month on-call. Enterprise: custom.

Not the right fit if: You're a small team (under 30 engineers) looking for something lightweight. The full stack runs $25-45/user/month, which can be more tool than you need at that size.

3. Rootly

For: Teams that want strong incident coordination with transparent pricing.

Rootly is Slack-native and focused on the coordination side of incidents: automated workflows, role assignment, status updates, and retrospectives. Transparent, usage-based pricing. No hidden upsells. Good automation for repetitive incident tasks like creating channels, paging responders, and posting status updates.

Pricing: Usage-based, publicly listed on their website.

Not the right fit if: You need alerting and paging in the same tool. Rootly focuses on coordination. You'll likely still need a separate paging solution for on-call.

4. Grafana Cloud IRM (OnCall / Incident)

For: Teams already using Grafana for dashboards.

Natural fit if you're in the Grafana ecosystem. Good alert routing and escalation.

Pricing (source): Free tier includes 3 active IRM users. Pro: $19/month platform fee (includes 3 active IRM users) + $20/month per additional active IRM user. An active IRM user is anyone in on-call schedules, escalation chains, or who takes incident actions during the billing month.

Not the right fit if: You're not already on Grafana. The open-source Grafana OnCall entered maintenance mode March 11, 2025 (source). New feature development is focused on Grafana Cloud IRM. The OSS version is maintenance-only and certain services stop working after archival.

5. Better Stack

For: Small teams that want monitoring + incidents + status pages in one place.

All-in-one approach. Replaces your monitoring, paging, and status page with a single product. Free tier with up to 10 monitors, a status page, 1 on-call responder, and Slack/email alerts (source).

Pricing: Free tier. Paid plans are transparent.

Not the right fit if: You need deep incident coordination or postmortem workflows. Better Stack does many things, but none as deep as a specialized tool.

6. FireHydrant

For: Teams who want incident management with service dependencies and runbook automation built in.

Dedicated PagerDuty migration path. Service dependencies, runbook automation, and change management included, not add-ons.

Pricing: Platform Pro is $9,600/year for up to 20 responders (source). Enterprise: custom.

Not the right fit if: You're a very small team (under 15 engineers). More features than you'll need at that size.

7. Build Your Own

There's a seventh option nobody lists in comparison posts: build it yourself.

With Claude, Cursor, and Copilot, a good engineer can spin up a Slack bot that creates incident channels, pages on-call, and logs a timeline in a weekend. It'll work great for three months.

Then Slack changes their permissions model. Or your paging script hits carrier rate limits at 2 AM. Or the engineer who built it takes a new job and nobody understands the state machine.

We wrote a full build vs buy analysis with real TCO numbers. The short version:

Building costs $246K to $413K over three years for a 20-person company. Buying costs $33K to $83K. That's 4-8x. And the build number assumes nothing goes wrong: no security incidents, no major API rewrites, no key engineer leaving.

AI made the initial build faster. It didn't change the maintenance math.

The hard parts aren't writing the first version:

Reliability under failure. Your incident tool must work when everything else is down. Most teams host theirs on the same infrastructure as their product. Production fails, the tool they need to coordinate the response fails with it.
Policy surface creep. Within 12 months you'll need RBAC, audit logs, data retention, compliance exports. Nobody budgets for this.
Ownership after the builder leaves. Median tenure at a startup is 2 years. Your custom incident bot will outlast its creator.

"You're not building a bot. You're adopting a forever-system."

When building makes sense: Unusual regulatory constraints, incident management is literally your product, or you have a dedicated engineer with explicit time allocation and a succession plan.

For everyone else, the math favors buying.

Also considered: Squadcast (mid-market pricing/feature balance), Splunk On-Call (formerly VictorOps, best if you're already in Splunk Observability), and staying on PagerDuty itself for large enterprise setups. We didn't include these in the main six because this post prioritizes Slack-native coordination tools and simpler self-serve setups. If you're migrating from OpsGenie specifically, our OpsGenie migration guide covers all of these in detail.

How to Pick

Start with your team size and what hurts.

Under 10 engineers: You probably don't need a dedicated tool. Structured Slack workflows + simple paging will carry you. If you buy, pick something with a free tier. Better Stack or Runframe's free plan.

10-80 engineers: You've outgrown scripts and spreadsheets. Enterprise tools will bury you in configuration. You need something that works in Slack, sets up in a day, and doesn't require a dedicated admin. Runframe, Rootly, or FireHydrant.

Start with Runframe's free plan. Setup takes less than a day.

80-200 engineers: You need real workflows. Automated escalation. Stakeholder notifications. Compliance-friendly postmortems. incident.io or Rootly at this scale. Runframe if you want to grow into something rather than scale down from something.

200+ engineers: You're enterprise. PagerDuty is often the right call. Or incident.io. At this scale, you have the team to manage complexity.

Four questions that matter more than feature lists:

Where does your team coordinate during incidents? If Slack, pick a Slack-native tool.
How many people need to be involved in incident setup? If more than one, your tool is too complex.
What's your budget per engineer per month? Be honest. Include add-ons.
How long can you afford for onboarding? If the answer is "a week," eliminate anything that takes longer.

Migration Checklist

Switching from PagerDuty (or any incident tool)? Here's what to cover:

Audit your current setup. List all escalation policies, on-call schedules, integrations, and routing rules. Export before you start.
Pick 2-3 tools to trial. Test with real scenarios, not demos.
Migrate on-call schedules first. This is the hardest part. CSV exports rarely import cleanly. Budget time to rebuild manually.
Rewire integrations one at a time. Start with critical monitoring (Datadog, Prometheus, CloudWatch). Test alert routing end-to-end.
Run parallel for 1-2 weeks. Keep the old tool active while you validate the new one. Roll back if something breaks.
Train the team. Run a mock incident. 2 hours per engineer saves weeks of confusion.
Cut over and decommission. Route 100% of alerts, keep the old tool as read-only backup for one more week, then shut it down.

Typical timeline: 3-10 days for teams under 50 engineers. 2-6 weeks for larger orgs, depending on integrations and schedule complexity.

For a detailed migration plan with timelines, see our OpsGenie migration guide. The process is similar regardless of which tool you're leaving.

The Bottom Line

PagerDuty built the incident management category and it's still a strong product for large enterprises with dedicated SRE teams.

But the market has more options now. Incident coordination moved to Slack. Pricing got more transparent. Simpler tools proved you don't need 200 features to run good incident response.

If you're evaluating, three things to check:

Is your team paying for features it doesn't use?
Does your team coordinate in Slack but manage incidents in a separate UI?
Did setup take weeks instead of days?

If the answer to any of these is yes, it's worth looking at what else is out there.

We built Runframe because we kept hearing the same thing from engineering teams:

"I just want the Heroku of incident management. Just make it work."

That's what Runframe is built to be. Try it free →

Common Questions

Is PagerDuty worth it for small teams?

Incident Communication Templates: 8 Free Examples [Copy-Paste]

Niketa Sharma — Sun, 01 Feb 2026 00:00:00 GMT

During a SEV0, everyone wants answers at once.

Executives want a timeline and business impact.
Support wants a script to calm customers down.
Sales/CSMs want something they can forward to key accounts.
Someone on social asks "are you aware?"
The person fixing the database keeps getting interrupted.

The technical fix might take 45 minutes. The communication mess can take 2 hours. Build your matrix → Free Severity Matrix Generator

This guide gives you copy-paste templates and a simple operating rule: one owner, one source of truth, consistent cadence.

The only framework you need

In incidents: status is the truth. Everything else points to it.

One owner: the Incident Commander (IC) owns outbound updates.
One source of truth: pick one place where updates live (customer email thread, status page, or a single internal update doc). Everything else should point to it.
One cadence: predictable updates beat "big updates when we feel like it."
Impact over internals: describe symptoms and scope, not system trivia.
Honest uncertainty: "unknown at this time" beats fake ETAs.

Frequently asked questions

Who should send incident updates?

The Incident Commander. The person debugging should not also be writing customer updates. For more on the IC role, see our incident response playbook.

How often should we update during a SEV0?

Every 15 minutes on your canonical source (status page or a customer email thread). If you don't have either, use a single internal update doc. Also every 15–30 minutes to executives. Always include the next update time.

What if we don't know the ETA?

Say "unknown at this time" and commit to the next update time. Fake ETAs destroy trust.

Template index (jump to what you need)

Status page incident communication templates
Customer outage email templates
Executive incident update templates
Support incident communication kit
Sales / CSM forwardable note
Internal engineering update
Social incident response templates
Post-incident customer summary

Who needs updates and what they actually want

Customers

Want: are we impacted, what changed, what's the workaround, when's next update.
Don't want: your root cause guesses.

Executives

Want: customer impact, revenue risk (or "unknown"), timeline, mitigations, next update.

Support

Want: a script + how to handle tickets + what not to promise.

Sales/CSMs

Want: a forwardable note for key accounts + status link + what to say on renewals.

Engineering

Want: what's broken, who owns it, what's next, where to coordinate.

Public/social

Want: acknowledgment + status link. Nothing else.

Cadence: how often to update

If you only remember one line: set the next update time in every message.

Recommended cadence (adjust for your business, but keep it consistent):

Severity | Customer/status page | Exec | Support | Social
Severity	Customer/status page	Exec	Support	Social
SEV0 (outage)	every 15 min	every 15–30 min	push when status changes + at least every 30 min	acknowledge once, then link
SEV1 (degraded)	every 30–60 min	every 30–60 min	push when status changes	usually link only
SEV2 (minor)	every 60–120 min	on request	push when status changes	none

Cadence (plain text):

SEV0 (outage): customer/canonical every 15 min · exec every 15–30 min · support on change + at least every 30 min · social: acknowledge once, then link
SEV1 (degraded): customer/canonical every 30–60 min · exec every 30–60 min · support on change · social: usually link only
SEV2 (minor): customer/canonical every 60–120 min · exec on request · support on change · social: none

Middle of the night does not change expectations. The IC might change; the cadence should not.

For more on severity levels, see our SEV0-SEV4 framework.

The master message map

To avoid fragmented comms, decide where each message type lives:

First: pick your canonical source

The "status page" in this article means whatever your canonical source is:

Status page (status.yourcompany.com): most common, public
Social: pointer only (rarely canonical — use it to link to your status page or email)
Customer email: B2B companies often skip public status pages entirely
Internal only: early-stage or regulated industries

The rule: one source, everything points to it. Don't let Slack say one thing and email say another.

Message destinations:

Canonical source: status page, customer email, or a single internal update doc — your timeline lives here.
Internal Slack channel: operational coordination + internal updates.
Support channel: the "support kit" pinned and updated.
Exec email/Slack: business impact + timeline + next update.
Social (if not your canonical source): acknowledgment + link.

Rule: if your canonical source says "Investigating," no other channel is allowed to say "Resolved in 10 minutes."

Template quick picker

Don't search during a SEV0. Find what you need instantly.

Scenario | Use template
Scenario	Use template
SEV0 declared, first 5 minutes	Status page: Initial
SEV0, 15 min later, no fix yet	Status page: Update (identified)
SEV0, fix implemented, monitoring	Status page: Update (mitigation in progress)
SEV0, resolved	Status page: Resolved
SEV0 lasting > 30 min, enterprise customers	Customer email: Initial notification
Executive asks "what's the impact?"	Executive update: Initial
Support getting slammed with tickets	Support kit: Initial
Key account at renewal risk, incident active	Sales/CSM note
Internal engineers asking "what's broken?"	Internal engineering update
Social media asking "are you aware?"	Social: Acknowledgment

Template quick picker (plain text):

SEV0 declared (first 5 minutes) → Status page: Initial
SEV0, 15 min later, no fix yet → Status page: Update (identified)
SEV0, fix implemented, monitoring → Status page: Update (mitigation in progress)
SEV0, resolved → Status page: Resolved
SEV0 > 30 min (enterprise customers) → Customer email: Initial notification
Exec asks "what's the impact?" → Executive update: Initial
Support getting slammed → Support kit: Initial
Key account at renewal risk → Sales/CSM note
Engineers asking "what's broken?" → Internal engineering update
Social asks "are you aware?" → Social: Acknowledgment

One filled example (SEV0 checkout outage)

Scenario: Checkout is failing with "Unable to process payment" for most customers.

Status page (initial):

We're experiencing an outage affecting checkout. Customers may see "Unable to process payment" errors. We're investigating.

Next update: 20:15 UTC

Exec update (initial):

We're investigating a SEV0 incident affecting checkout.

Impact:
- Customer checkout failing for most traffic (scope still being confirmed)
- Revenue impact: unknown at this time

Timeline:
- Started: 20:00 UTC
- Status: Investigating
- ETA: unknown at this time

Next update: 20:30 UTC

Support kit (initial):

What to tell customers:
"We're aware of an outage affecting checkout. We're investigating and posting updates here: [status link]. Next update by 20:15 UTC."

Do NOT promise:
- Resolution times
- Credits
- Root cause guesses

Good vs bad: why wording matters

Most incident communication fails because it talks about internals instead of impact.

❌ Bad update:

"We're experiencing database replication lag on shard 3. The GC pause caused a cascading failure in the payment microservice. We're restarting the pods and investigating the root cause. Our SRE team is looking into query optimization."

Why it's bad:

Customers don't know what "shard 3" or "GC pause" means
"Microservice" and "pods" are internal jargon
No clear next update time
Doesn't say whether they can use your product

✅ Good update (using this template):

"We're experiencing an outage affecting checkout. Customers may see 'Unable to process payment' errors. We're investigating.

Next update: 3:15 PM ET"

Why it works:

Clear impact: "checkout" is down, "payment" errors
Specific symptom: customers know what to expect
Next update time: sets expectations
No technical jargon: describes what customers see, not what's broken internally

The pattern: Describe symptoms, not systems. Customers care about "can I check out," not "your database shard."

Copy-paste templates

1) Status page incident communication templates

SEV0: complete outage

Initial (send within 5 minutes of declaring incident):

We're experiencing an outage affecting [service].
Customers may see [symptom]. We're investigating.

Next update: [HH:MM TZ] (in 15 minutes)
Status: Investigating

Update (identified, working on fix):

We've identified the issue and are working on a fix.
Customers may continue to see [symptom].

Next update: [HH:MM TZ]

Update (mitigation in progress / partial recovery):

We've applied a mitigation and are monitoring recovery.
Some customers may still see [symptom] while systems stabilize.

Next update: [HH:MM TZ]

Resolved:

This incident is resolved. [Service] is operating normally.

We'll share a brief post-incident summary within [24–48 hours].

SEV1: degraded performance

We're seeing degraded performance affecting [service].
Some customers may see [symptom]. We're investigating.

Next update: [HH:MM TZ] (in 30–60 minutes)

SEV2: minor impact / limited scope

Some customers may be experiencing [symptom].
This affects [region / tier / % of users]. We're investigating.

Status page "what not to do"

Don't post internal jargon ("shards," "rebalance," "GC pause").
Don't promise resolution times you can't keep. Promise the next update time instead.
Don't write 200-word paragraphs. Keep it under ~100 words.

2) Customer outage email templates (only when needed)

Use customer email when:

SEV0 lasts > 30–60 minutes, or
regulated / high-trust domain requires it, or
you have contractual comms obligations.

Customer email: initial notification

Subject: Service disruption affecting [Product/Feature]

We're currently experiencing an issue impacting [Product/Feature].

What you may see:

[Symptom 1]

[Symptom 2] (optional)

Current status: Investigating
Latest updates: [Status page URL]

Next update by: [HH:MM TZ]

We're sorry for the disruption.
[Company] Team

Customer email: recovery in progress

Subject: Update: [Product/Feature] disruption (recovery in progress)

We've identified the cause and are implementing a fix.

Current impact:

[Symptom] (if changed, say what changed)

Latest updates: [Status page URL]
Next update by: [HH:MM TZ]

[Company] Team

Customer email: resolution + next steps

Subject: Resolved: [Product/Feature] disruption

The issue affecting [Product/Feature] is resolved.

Duration: [X minutes/hours]
Impact: [brief, customer-facing impact]

We'll publish a short post-incident summary within [24–48 hours] here:
[Link to summary or status page incident post]

[Company] Team

3) Executive incident update templates (forwardable)

Executives want business impact + timeline + next update.

Exec update: initial

Subject: Incident Update: [Service] — SEV0 — [HH:MM TZ]

We're investigating a SEV0 incident affecting [service].

Impact:

Customers affected: [X / % / unknown]

Customer symptoms: [checkout failing / login errors / etc.]

Revenue/contract risk: [known estimate / unknown at this time]

Timeline:

Started: [HH:MM TZ]

Current status: Investigating

ETA: [honest estimate or "unknown at this time"]

Next update: [HH:MM TZ] (in 15–30 minutes) or sooner if status changes.

[Name], Incident Commander

Exec update: follow-up (delta-based)

Subject: Update: [Service] incident — [Status]

What changed since last update:

[1–3 bullets]

Current status: [Investigating / Fix in progress / Monitoring / Resolved]
Revised ETA: [if known / unchanged / unknown]

Next update: [HH:MM TZ]

Exec summary: post-incident (within 24 hours)

Subject: Post-Incident Summary: [Service] — [Date]

The incident affecting [service] is resolved.

What happened (high level):

[1–2 sentences]

Business impact:

Duration: [X]

Customers affected: [X / %]

Revenue impact: [known / unknown]

Root cause (high level):

[1–2 sentences]

What we're doing to prevent recurrence:

[Action + owner + due date]

[Action + owner + due date]

[Action + owner + due date]

Postmortem: [link] (due [date])

4) Support "incident communication kit" (paste into Slack + pin)

Support needs a script and clear boundaries.

Support kit: initial

🚨 INCIDENT COMMUNICATION KIT

Incident: [Service] is [down / degraded]
Severity: SEV0/SEV1/SEV2

Customer impact:

[What customers are experiencing]

Status page:

[URL]

What to tell customers (copy/paste):
"We're experiencing an issue affecting [service]. Our team is investigating.
We're posting updates here: [URL]. Next update by [HH:MM TZ]."

Do NOT promise:

Resolution times

Credits/compensation

Root cause guesses

ETA:

[honest estimate / unknown at this time]

Next support update:

[HH:MM TZ]

Owner:

[Incident Commander] in #[incident-channel]

Support kit: update (only when context changes)

🚨 INCIDENT UPDATE — [HH:MM TZ]

What changed:

[1–3 bullets]

Updated customer script:

[only if needed; otherwise "same as above"]

Next support update:

[HH:MM TZ]

For more on incident coordination, see our guide on reducing context switching during incidents.

5) Sales/CSM "key account note" (forwardable, low drama)

Use this when:

customers are enterprise/high-touch, or
you have renewal risk, or
accounts are likely to escalate.

Subject: Update: [Service] disruption — status + next update

Sharing a quick update on an incident affecting [service].

Current customer impact:

[One sentence]

Latest updates:

[Status page URL]

Next update by:

[HH:MM TZ]

If your customer asks for details:

Keep it to impact + status link. Avoid root-cause speculation.

6) Internal engineering update (context without noise)

This is for broad awareness, not incident-room debugging.

FYI: SEV0/SEV1 incident in progress for [service].

Customer impact:

[One sentence]

Incident channel:

#[channel]

IC:

[Name]

Status page:

[URL]

Next update:

[HH:MM TZ]

For more on incident roles, see our incident response playbook with roles and escalation rules.

Goal: acknowledge + link to status page. Nothing else.

Acknowledgment (within 5–10 minutes of public awareness):

We're aware of an issue affecting [service] and are investigating.
Updates: [canonical source URL]

If issue persists > 1 hour:

Still working on the [service] issue. Latest updates:
[canonical source URL]

After resolution:

The [service] issue is resolved. Thanks for your patience.
We'll share a post-incident summary within [24–48 hours].

8) Post-incident customer summary (short, trust-building)

This is not the engineering postmortem. It's a customer-facing closure.

Post-incident summary (customer-facing)

Incident: [1 sentence]

Duration: [X]

Customer impact: [1 sentence]

What we changed: [1–2 bullets]

How we'll prevent recurrence: [1–3 bullets]

For postmortem templates, see our post-incident review templates with 3 ready-to-use formats.

Common communication failures (and how to prevent them)

1) The debugger is also the communicator

Fix: separate roles. IC owns comms; engineers fix.

2) "We'll be back in 10 minutes"

Fix: next update time, not resolution time.

3) Explaining internals instead of impact

Fix: describe symptoms, scope, workarounds.

4) Fragmented messaging

Fix: pick one canonical source (status page, customer email, or a single internal update doc) and make everything point to it.

5) Radio silence after resolution

Fix: close the loop with a short summary within 24 hours.

How Runframe bakes this into your Slack incident workflow

Most teams don't fail at templates—they fail at consistency. The hard part is enforcing: one owner, one canonical source, and a predictable cadence when everyone is stressed.

Runframe operationalizes the exact rules above inside Slack:

Role assignment: the Incident Commander owns outbound updates. The scribe and responders stay focused on the fix.
Canonical-source discipline: Runframe treats your chosen source (status page or customer email) as the timeline and makes every other update point to it.
Cadence prompts: if a SEV0 is active and the next update time passes, Runframe nudges the IC to post the next update (no more "we forgot for 45 minutes" gaps).
Channel-specific templates: the IC can post a customer-safe update, an exec update, or a support kit update without rewriting from scratch.

Two concrete examples:

SEV0 declared: IC posts "Status page: Initial" (copy-paste), then instantly posts the support kit template in #support with the status link.
Update time reached: Runframe prompts the IC with the exact "Update (identified)" block so the next update goes out on time, with no fake ETA.

The bottom line

Incident communication is a system, not a talent.

Assign one owner (IC).
Keep one source of truth (your canonical source).
Use predictable cadence.
Talk in impact, not internals.
Say "unknown" when it's unknown.

Templates make this easy. They also make you look calm under pressure.

Read more:

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

Niketa Sharma — Mon, 26 Jan 2026 00:00:00 GMT

You've seen the sales deck: "99.9% uptime guaranteed."

Ask the engineering team what that means. What happens when you miss it? Who decides what counts as downtime?

Often, nobody can answer quickly.

SLA, SLO, and SLI get used interchangeably. Teams set arbitrary targets ("let's do 99.9% because everyone else does"), then wonder why customers are angry when "nothing technically broke."

These aren't synonyms. They serve completely different purposes.

Here's what each one actually means and how to use them without creating busywork.

What You'll Learn

What SLI, SLO, and SLA actually mean (and why the order matters)
How to pick SLIs that customers care about (not just what's easy to measure)
How to set realistic SLO targets (not copy-paste 99.9%)
Error budgets: the framework that stops "is this urgent?" arguments
Copy-paste SLO template (30-minute setup)
Common mistakes and how to avoid them

SLI: What You Measure

Service Level Indicator. The actual metric you track.

SLI is the measurement. SLO is the target. SLA is the promise.

Good SLIs: Error rate, latency (p95, p99), availability. Things customers notice.

Bad SLIs: CPU utilization, memory usage, disk space. Things ops teams notice but users don't.

The trap: picking SLIs because they're easy to measure, not because they matter.

Track CPU as your SLI and you'll spend months optimizing it. Meanwhile, API latency spikes to 5 seconds and customers can't log in. Your dashboard looks perfect. Customers are furious.

The rule: If a user wouldn't notice it breaking, it's not an SLI. It's just a metric.

Common SLIs by Service Type

Service Type | Good SLI | Why It Matters
Service Type	Good SLI	Why It Matters
API	Success rate (2xx/total requests)	Users see errors directly
API	Latency (p95 < 500ms)	Slow = broken for users
Database	Query success rate	Failed queries = broken features
Frontend	Time to interactive	Users abandon slow pages
Background jobs	Processing time per job	Delayed jobs = broken workflows

Pick 1-2 SLIs per service. More than that and you're tracking everything, optimizing nothing.

SLO: Your Internal Target

Service Level Objective. The number you're aiming for.

SLOs are internal targets. SLAs are external promises.

Example: "99.5% of API requests succeed within 500ms."

SLI = request success rate + latency
SLO = 99.5% threshold

SLOs are internal. You don't publish them to customers. They're how engineering defines "good enough" and aligns with incident response playbooks.

How to Pick an SLO (Don't Copy-Paste 99.9%)

Step 1: Look at your last 30 days

What are you actually delivering right now?

If you're at 99.3%, don't set a target of 99.9%. You'll miss it immediately and the number becomes meaningless.

Step 2: Set the target slightly below current reality

Give yourself room for bad days.

Current performance: 99.7%
Target SLO: 99.5%
Buffer: 0.2% for unexpected issues

Step 3: Validate it maps to user experience

Ask: "If we hit 99.5%, will customers be happy?"

If the answer is no, your SLI is wrong (not your target).

Monthly vs Weekly SLOs

Most teams use monthly SLOs because:

SLAs (contracts) are typically monthly
Industry standard for reporting
Easier to absorb bad days

But track weekly burn rate to avoid surprises:

Monthly SLO: 99.5% = 216 minutes allowed downtime
Weekly burn rate: 216 ÷ 4.33 ≈ 50 minutes/week
If you burn 200 minutes in week 1, you're in trouble

Policy example:

Track monthly SLO (99.5%)
Review weekly burn rate
Trigger escalation at 50% of monthly budget burned

The Cost of Nines

Each additional "9" is often an order-of-magnitude more effort/cost, depending on architecture and org maturity.

Uptime Target | Downtime/Year | Downtime/Month | What It Takes
Uptime Target	Downtime/Year	Downtime/Month	What It Takes
99%	3.65 days	~7.2 hours	Basic monitoring, manual responses
99.5%	1.83 days	~3.6 hours	Automated alerts, on-call rotation
99.9%	8.77 hours	~43 minutes	Redundancy, automated failover
99.99%	52 minutes	~4 minutes	Multi-region, chaos engineering

Promise 99.99% to win a deal and you might spend $50k/month on infrastructure for a $5k/month customer.

Sales shouldn't set SLOs without engineering sign-off.

SLA: Your External Promise

Service Level Agreement. The contract with consequences.

SLAs are external. They define what happens when you miss your target.

Example: "We commit to 99.5% monthly uptime. If we fall below, you get a 10% service credit."

Who Needs an SLA?

Yes:

B2B selling to enterprises
Contracts with procurement teams
Customers who require guaranteed uptime

No:

Early-stage startups (under 50 customers)
Internal tools
Self-serve products with monthly billing

A 20-person startup calculating SLA credits for $50/month customers is creating accounting busywork without meaningful upside.

Smart Buffer: Internal SLO > External SLA

Don't promise externally what you barely deliver internally.

Example setup:

Internal SLO: 99.7% (what engineering targets)
External SLA: 99.5% (what customers get promised)
Buffer: 0.2% for unexpected issues

Gives you room to have a bad week without breaching customer contracts.

Error Budget: What Makes This Actually Useful

Error budget is how teams decide: ship features, or pay down reliability debt.

SLOs without error budgets are just numbers on a dashboard.

Error budgets turn SLOs into a prioritization framework.

The Math

Error budget = 100% - SLO target

If your SLO is 99.5%, your error budget is 0.5%.

SLO Target | Error Budget/Month | Weekly Burn Rate Estimate
SLO Target	Error Budget/Month	Weekly Burn Rate Estimate
99.9%	~43 minutes	~10 minutes
99.5%	~3.6 hours	~50 minutes
99%	~7.2 hours	~1.7 hours

Weekly burn rate = monthly budget ÷ 4.33 weeks. Track weekly to avoid burning entire monthly budget early.

How Teams Use Error Budgets

The rule: If you have budget left, ship features. If you're burning budget, stop shipping and fix reliability.

Example policy:

Weekly error budget drops below 50%? → Triage. Identify root cause.
Weekly error budget drops below 20%? → Feature freeze. Reliability becomes priority #1.
Error budget refills weekly. Start fresh every Monday.

No more arguments about "is this urgent?"

Burning error budget = urgent. Not burning = queue it.

How to Set Your First SLO in 30 Minutes

Here's the step-by-step process.

Step 1: Pick Your Most Important Service (5 minutes)

Start with one service. The one customers complain about when it breaks.

API? Database? Frontend?

Step 2: Choose 1-2 SLIs (10 minutes)

Ask: "What do users notice when this breaks?"

For an API:

Success rate (requests returning 2xx / total requests)
Latency (p95 response time)

For a database:

Query success rate
Query latency (p99)

For a frontend:

Page load time (p95)
Time to interactive

Pick the one that matters most. Don't track everything.

Step 3: Measure Current Performance (10 minutes)

Pull the last 30 days of data.

What's your actual success rate? 99.2%? 99.7%? 98.5%?

Be honest. No aspirational numbers.

Step 4: Set Target Slightly Below Reality (5 minutes)

Current: 99.7%
Target SLO: 99.5%

Give yourself buffer.

Done. You Have an SLO.

Now track it weekly. When you burn error budget, investigate. When you have budget, ship features.

SLO Template (Copy-Paste)

Use this to document your first SLO.

## SLO: [Service Name]

**Service:** [e.g., Payment API]
**Owner:** [Team name]
**Last updated:** [Date]

### SLI (What We Measure)
- Metric: [e.g., Request success rate]
- Definition: [e.g., HTTP 2xx responses / total requests]
- Measurement window: [e.g., Monthly, evaluated weekly]

### SLO (Our Target)
- Target: [e.g., 99.5% success rate]
- Current performance (last 30 days): [e.g., 99.7%]
- Error budget: [e.g., 0.5% = 216 minutes/month or ~50 minutes/week burn rate]

### SLA (External Promise) - Optional
- Customer promise: [e.g., 99.5% monthly uptime]
- Consequence: [e.g., 10% service credit if breached]
- Measurement period: [e.g., Monthly]

### Escalation Policy
- Error budget < 50%: Triage, identify root cause
- Error budget < 20%: Feature freeze, fix reliability
- Error budget refills: Weekly (every Monday)

Combine with [incident severity levels](/blog/incident-severity-levels) to align response urgency.

### How We Measure
- Dashboard: [Link to dashboard]
- Alert: [Link to alert config]
- On-call: [Link to on-call schedule]

Copy this. Fill in the blanks. You're done.

Real Examples (What This Looks Like in Practice)

Here are common patterns.

Example 1: API Service (B2B SaaS)

Service: User authentication API
SLI: Request success rate
Internal SLO: 99.7% weekly
External SLA: 99.5% monthly
Error budget: ~30 min/week (internal), ~3.6 hours/month (external)

How they use it:

Daily dashboard shows weekly SLO burn rate
If weekly drops below 99.5%, all-hands triage
Sales can't promise below 99.5% without engineering sign-off
If error budget hits 20%, feature work pauses

Why it works: Clear line between "we're fine" and "drop everything."

Example 2: Background Job Processing

Service: Email sending queue
SLI: Processing time per job
Internal SLO: 95% of jobs processed within 5 minutes
External SLA: None (internal tool)
Error budget: 5% of jobs can exceed 5 minutes

How they use it:

Jobs taking > 5 minutes get logged
If more than 5% exceed threshold in a day, investigate
No external SLA because it's internal tooling

Why it works: Simple threshold, no customer promises needed.

Example 3: The Team That Set 99.99% and Regretted It

A startup promised 99.99% uptime to land an enterprise deal.

The contract was $10k/month. The infrastructure to deliver 99.99%? $30k/month in redundancy, multi-region failover, and 24/7 on-call. Build your schedule → Free On-Call Builder

Six months in, they renegotiated down to 99.5%. The customer didn't care (they never checked the SLA). Engineering stopped hemorrhaging budget.

The lesson: Don't promise nines you can't afford.

What Teams Get Wrong

Mistake 1: Copying 99.9% Without Doing the Math

99.9% uptime = ~8.7 hours/year downtime allowed
99.99% uptime = ~52 minutes/year downtime allowed

The gap is often an order-of-magnitude more expensive to achieve.

Chase 99.99% because a competitor claimed it and you'll discover they measured it differently.

Mistake 2: Setting SLOs You Can't Measure

Team sets 99.9% uptime but doesn't have:

Automated monitoring
Clear definition of what counts as "down"
Alerting when they're out of SLO

Your SLO is 99.9%. Someone asks "how did we do last month?" and the answer is "we haven't set that up yet."

That's not an SLO. That's a goal written on a napkin.

Mistake 3: No Buffer Between Internal and External

Team sets:

Internal SLO: 99.5%
External SLA: 99.5%

First bad week? Immediate SLA breach. Customer credits. Angry emails.

Better:

Internal SLO: 99.7%
External SLA: 99.5%
Buffer: 0.2% wiggle room

Gives you space to have a bad week without breaching contracts.

Mistake 4: Too Many SLOs

Team tracks 15 SLOs across 3 services.

Result: Everything's yellow. Nothing's a priority. Analysis paralysis.

Better: 1-2 SLOs per service. Track what matters. Ignore the rest.

Mistake 5: SLOs Nobody Checks

Team sets SLOs in a wiki. Nobody looks at them until a customer complains.

Better: Daily dashboard. Weekly review. Automated alerts when burning error budget.

If nobody's checking your SLO, you don't have an SLO.

Error Budget Calculator

Use this to calculate your error budget.

Formula:

Error budget (minutes/month) = (100% - SLO%) × 43,200 minutes

Examples:

SLO | Calculation | Error Budget/Month
SLO	Calculation	Error Budget/Month
99.9%	(100% - 99.9%) × 43,200	43.2 minutes
99.5%	(100% - 99.5%) × 43,200	216 minutes (3.6 hours)
99%	(100% - 99%) × 43,200	432 minutes (7.2 hours)
95%	(100% - 95%) × 43,200	2,160 minutes (36 hours)

Weekly estimate (from a monthly SLO):
Divide the monthly minutes by 4.33 (weeks per month)

99.5% monthly SLO = ~50 minutes/week error budget

Quick Reference

Term | What It Is | Who Sets It | Example | Public?
Term	What It Is	Who Sets It	Example	Public?
SLI	The metric you track	Engineering	Error rate, latency	No
SLO	Your internal target	Engineering	99.5% success rate	No
SLA	Your external promise	Business/Legal	"99.5% uptime or 10% credit"	Yes

Key insight: SLIs and SLOs are for engineering. SLAs are for customers and contracts.

The Bottom Line

SLI = what you measure (pick what users notice, not what's easy)
SLO = your internal target (set it below current reality, not aspirational)
SLA = your external promise (only if selling to enterprises)

Use error budgets to drive prioritization. Stop arguing about "is this urgent?" Let your error budget decide.

Start with 1 service, 1-2 SLIs, 1 SLO. Add complexity only when needed.

If you're setting SLOs based on competitor claims, you'll end up optimizing the wrong thing. Set them based on what you can actually deliver, then improve.

Common Questions

What's the difference between SLO and SLA?

Runbook vs Playbook: The Difference That Confuses Everyone

Niketa Sharma — Sat, 24 Jan 2026 00:00:00 GMT

Recently, an engineering lead asked us a question that keeps coming up:

"What's the difference between a runbook and a playbook? I feel like everyone uses them interchangeably."

He wasn't wrong. We've seen plenty of teams with a "runbook" that's actually a playbook, and vice versa. The confusion isn't just semantics, it causes real problems.

Your incident responder grabs the "runbook" looking for who to notify, but finds 50 pages of Linux commands instead.

Or your engineer opens the "playbook" expecting step-by-step instructions for restarting Kafka, but gets a vague "coordinate with stakeholders" paragraph instead.

This pattern shows up repeatedly once teams start running real on-call: runbooks and playbooks serve completely different purposes, and conflating them wastes time during outages. Build your schedule → Free On-Call Builder

Here's the difference.

In incidents: runbooks help you execute fixes; playbooks help you coordinate people.

What You'll Learn

What a runbook actually is (and what it's for)
What a playbook actually is (and what it's for)
The runbook vs playbook difference in one comparison table
Copy-paste templates for both (15-minute playbook, 30-minute runbook)
When to create each (and why most teams need both)
A few real-world failure modes (what breaks when you mix them up)

What is a Runbook?

A runbook is operational documentation. It's the step-by-step instructions for performing a specific technical task.

Think: "How do I restart the database cluster?" or "What's the exact command to flush the Redis cache?"

Runbooks are written for automation or precise human execution. They assume the reader knows what to do, they just need to know how.

A runbook looks like this:

# Flush Redis cache safely
redis-cli FLUSHDB

# Verify flush
redis-cli DBSIZE
# Expected output: 0

# If flush fails, check master-slave status
redis-cli INFO replication

Notice what's missing: no discussion of who to notify, no decision trees, no "if this happens, page that person." That's not what a runbook is for.

One engineer described it as: "Our runbooks are basically scripts in plain English. They're the cheat sheet I wish I had when I joined."

Runbooks work best for:

Repetitive operational tasks (deployments, restarts, backups)
Complex command sequences ("always run X before Y")
Reducing human error in high-stress situations
Onboarding (new engineers can follow the steps safely)

What is a Playbook?

A playbook is coordination documentation. It's the who, what, and when of incident response, not the technical how.

Think: "Who declares an incident?" "When do we page the VP?" "What do we tell customers?"

Playbooks are written for humans making decisions under pressure. They assume the reader knows how to fix the technical problem, they need to know who should do what.

A playbook looks like this:

## SEV-2 Incident Declaration

Who can declare: Any engineer
Where: #incidents
What to include:
- Severity level (SEV-0/1/2/3)
- Service affected
- Customer impact (Yes/No)
- Current status (Investigating / Identified / Monitoring / Resolved)

Within 5 minutes:
- @ mention Incident Commander in #incidents
- IC assigns roles (Communications Lead, Scribe)
- If customer-impacting: Customer Support notified within 10 min

Escalation:
- 30 min unresolved → IC pages Engineering Manager
- 60 min unresolved → EM pages VP Engineering

Notice the difference: no bash commands, no technical implementation details. The playbook is about people and process, not machines.

Playbooks work best for:

Incident response (who does what, when)
Communication templates (what to say to customers)
Escalation rules (when to page whom)
Role clarity (who's in charge of what)

The Key Differences (Quick Reference)

Aspect | Runbook | Playbook
Aspect	Runbook	Playbook
Purpose	Technical execution	Team coordination
Written for	Automation or precise human steps	Humans making decisions
Answers	"How do I do X?"	"Who handles X?"
Content	Commands, scripts, technical steps	Roles, communication, escalation
Usage	During investigation & fix	During entire incident lifecycle
Updates	When infrastructure changes	When process or team changes
Example	"How to flush Redis cache"	"Who declares a SEV-2 incident"

This is the framework most teams settle on after a few painful incidents.

Which Do You Need?

The answer is almost always: both.

Here's why:

Runbooks without playbooks: Your engineers know exactly how to restart the database. But nobody knows who's supposed to communicate with customers, or when to escalate to the VP. You resolve the technical incident quickly, but the coordination incident drags on for hours.

Playbooks without runbooks: Everyone knows their role. The Incident Commander is assigned, Communications Lead is drafting customer emails. But the person investigating has to fumble through Stack Overflow because nobody documented how to restart your custom service. The incident takes longer than necessary.

A common failure mode: the IC knows the process, but the fixer is still guessing the commands. That's when teams end up writing both.

The sweet spot: Start with playbooks. They're higher leverage. Then build runbooks for your most common failure modes (database issues, cache problems, third-party API failures).

How to Build Your First Playbook (15-Minute Template)

Start here. Copy this template into your incident management system.

Basic Incident Playbook Template

Severity Levels:

SEV-0: Critical (revenue stopped, security breach)
SEV-1: High (major feature down, large customer impact)
SEV-2: Medium (degraded performance, some users affected)
SEV-3: Low (minor issue, workaround available)

Who Declares Incidents:
Anyone on the engineering team

Where:
#incidents Slack channel

Incident Commander Role:

Assigns roles (Communications Lead, Scribe)
Makes decisions
Calls incident resolved

Escalation Rules:

SEV-0/1: Page on-call lead immediately
30 min unresolved → Page Engineering Manager
60 min unresolved → Page VP Engineering

Customer Communication:

Customer-impacting? → Notify Support within 10 min
Communications Lead drafts status page update
IC approves before publishing

That's it. You just built a playbook.

How to Build Your First Runbook (30-Minute Template)

Pick your most common incident. Document it.

Basic Runbook Template

Title: How to Restart the API Service

When to use this:

API health check failing
5xx errors above 5%
Customer reports "can't log in"

Prerequisites:

SSH access to production
kubectl access to k8s cluster

Steps:

Check current status

kubectl get pods -n production | grep api

Expected: 3/3 pods running

Identify failing pod

kubectl describe pod api-xxx -n production

Look for: CrashLoopBackOff or OOMKilled

Restart the service

kubectl rollout restart deployment/api -n production

Verify restart

kubectl rollout status deployment/api -n production

Expected: "successfully rolled out"

Confirm health

curl https://api.yourcompany.com/health

Expected: 200 OK

If this doesn't work:

Check database connectivity
Review recent deployments
Page database on-call

Last updated: 2026-01-24
Owner: Platform team

Done. You just built a runbook.

Real-World Scenarios (Composite Examples)

These are composites of patterns teams hit; details are anonymized.

The Team That Learned the Hard Way

A Series B infrastructure team had extensive runbooks. Pages of documented commands for every service.

But during a SEV-1, nobody knew who was supposed to talk to the CEO. The Incident Commander thought the VP would handle it. The VP thought the IC would handle it. The CEO found out from a customer tweet.

Their fix: A simple playbook with a "Who communicates with executives?" section. They still have the runbooks, they just added the coordination layer on top.

The Team That Kept It Simple

A 20-person startup didn't have bandwidth for extensive documentation. They started with a one-page playbook:

Who declares incidents (anyone)
Where they're declared (#incidents)
Three severity levels (SEV-0/1/2)
When to page whom

That's it. No runbooks initially. When incidents happened, they added runbook sections for the specific things that kept breaking. Six months later, they had a lightweight but complete system.

Their approach was simple: playbook first, runbooks as incidents repeat.

The Team That Automated

A 50-person company took it a step further. Their runbooks were literally executable scripts. When an incident hit, the engineer on call could either:

Follow the runbook manually (step-by-step commands)
Run the automated script that was the runbook

Their playbook sat on top, describing who should run which script and when to escalate if the script failed.

This is the ideal state: runbooks become executable, playbooks stay human-readable.

The Team That Wasted 2 Hours

A 30-person startup had a great playbook. Everyone knew their roles. Incident Commander was clear, Communications Lead handled customer updates.

But when their Postgres database locked up, the on-call engineer spent 2 hours Googling "how to kill postgres connections safely." They'd had this incident before. Three times. Nobody had documented the fix.

After that incident, they created a simple runbook: "How to Kill Postgres Connections Without Downtime." Took 20 minutes to write. Saved 2 hours on the next incident.

The lesson: Runbooks don't need to be comprehensive. Document the thing that keeps breaking.

The Bottom Line

Runbooks are for execution. They answer "how do I do this technically?"
Playbooks are for coordination. They answer "who handles this, and when?"
Most teams need both. Start with playbooks (higher leverage), add runbooks for common failures
Don't conflate them. A runbook that's trying to be a playbook does neither well
Keep them separate. Runbooks go in your code repo or docs. Playbooks live in your incident response system

One fixes the tech. The other coordinates the humans.

Most teams end up with both, playbook first, runbooks for repeat failures.

Common Questions

Which should I build first?

OpsGenie Shutdown 2027: The Complete Migration Guide

Niketa Sharma — Fri, 23 Jan 2026 00:00:00 GMT

OpsGenie support ends April 5, 2027. That date might feel distant.

Teams who already migrated will tell you otherwise. It takes longer than expected.

We interviewed 25 engineering teams about incident management. Three were using OpsGenie and shared their migration experiences. Most knew the shutdown was coming but hadn't started planning. They were waiting.

Here's what those 3 teams learned, the mistakes they made, and what works when migrating off OpsGenie. If you're still deciding which tool to migrate to, start with our OpsGenie alternatives comparison — it covers what changed in the market since mid-2025.

You're not just swapping tools. Atlassian is pushing everyone to Jira Service Management or Compass. Both handle alerting and on-call. Several teams we talked to considered leaving Atlassian rather than choosing between JSM and Compass. Build your schedule → Free On-Call Builder

OpsGenie End of Life Timeline

Key dates (source):

Date | What Happens | Impact
Date	What Happens	Impact
June 4, 2025	New sales stopped	Complete
April 5, 2027	End of support	Everyone must migrate

What Atlassian is doing

Atlassian is moving OpsGenie users to Jira Service Management (IT ops + incident workflows) or Compass (alerting + on-call + software catalog).

The problem? Most teams had one tool. Now Atlassian wants them to pick between two. Or pay for both. That's why some teams consider third-party tools instead of choosing between JSM and Compass.

Why teams migrate early

From our interviews, teams who waited regretted it. Migration takes 4-8 weeks for basic setups. Complex setups with many integrations took 8-16 weeks. Rushed migrations cause incidents during cutover.

Teams who migrated successfully started early, tested thoroughly, and ran both tools in parallel before switching.

What 3 Teams Told Us About Migrating from OpsGenie

We talked to 25 teams about incident management. Three were using OpsGenie and shared migration stories. Here's what happened.

Most teams were waiting

All 3 knew about April 2027. But none were being proactive. They knew it was coming. They weren't doing much about it.

Teams who migrated successfully started planning months ahead and ran parallel systems before cutover. Starting late increases incident risk during migration.

Timeline reality check

What teams expected: "2 weeks to migrate."

What actually happened: 6-8 weeks for simple setups. 8-16 weeks for complex ones.

Everyone underestimated timeline by 2-3x. Just migrating on-call schedules took 1-2 weeks for teams with complex rotations.

What teams struggled with

Timeline. Everyone thought 2 weeks. Reality was 6-8 weeks minimum. Start earlier than you think.

On-call schedules. CSV exports don't import cleanly into other tools. Most teams rebuilt schedules manually. Took 1-2 weeks.

Integrations. One team had 18 integrations. Five didn't have replacements in the new tool. Budget time to rebuild from scratch.

Coordination. Switching tools didn't fix coordination problems. If your issue is context switching during incidents, a new tool alone won't solve it unless designed for coordination.

Buyer remorse. One team picked the cheapest option and regretted it at scale. Three months later, they migrated again. If you're weighing building your own, our build vs buy breakdown covers the real 3-year TCO.

Common regrets

Every team had at least one:

Not auditing integrations first. Some have no direct replacements.
Underestimating schedule migration time. CSV exports rarely import cleanly.
Focusing on alerting features instead of coordination workflows.
Not testing with real incidents before cutover. Teams we spoke to who skipped this were more likely to hit cutover issues.
Choosing on price alone. Led to re-migration later.

Staying on Atlassian: JSM vs Compass

Before looking at OpsGenie alternatives, understand what Atlassian offers. You're not losing incident management. You're moving to a different Atlassian product.

The two Atlassian options

Jira Service Management (JSM)

JSM is positioned as IT operations and service management. Beyond alerting and on-call, JSM includes incident management workflows, change and problem management, service request portals, asset management and knowledge base, plus Jira integration.

JSM works for teams with compliance requirements but feels complex for Slack-native startups. Built for ITIL and ITSM teams who need full service management.

Jira Compass

Compass targets engineering teams with alerting, on-call, and a software catalog. Key features: alerting and on-call scheduling, escalation policies, software catalog for services and dependencies. Less ITSM overhead than JSM.

Compass is for engineering teams who want incident response without ITSM complexity.

Reality check

Most teams we talked to didn't want to navigate this choice. They had one tool (OpsGenie). They didn't want to figure out JSM vs Compass. Or pay for both.

That's why some teams in our research considered third-party tools.

OpsGenie Data Export and Parallel Run

Can you run OpsGenie in parallel with your new tool? How long do you have to export data?

Data export window

Opsgenie access ends April 5, 2027, and unmigrated data will be deleted after that date. Export well before then (e.g., by March 2027) to avoid last-minute risk.

What you can export:

On-call schedules (API or CSV)
User lists and roles
Integration configurations
Escalation policies and routing rules
Incident history and alert logs

Warning: Teams report CSV exports don't import cleanly. Budget time to rebuild schedules manually.

Running parallel systems

You can and should run both tools during migration. After migration, you'll have up to 120 days before Opsgenie is permanently shut down (you can turn it off sooner). Plan your parallel run inside that window.

Recommended parallel schedule:

Week 1-2: OpsGenie active, new tool testing
Week 3-4: Route 25-50% alerts to new tool
Week 5-6: Route 100% alerts to new tool, keep OpsGenie as backup
Week 7-8: Decommission OpsGenie

Why parallel matters: You can roll back immediately if something breaks. Teams we spoke to who cut over without a parallel run were more likely to hit incidents during migration.

Cost consideration: Yes, you pay for both tools temporarily. An incident during rushed migration costs more than a few weeks of duplicate subscriptions.

OpsGenie Alternatives: 7 Tools Teams Actually Chose

We interviewed teams who migrated from OpsGenie. These are the tools they picked and why.

Disclosure: Runframe is our product; it's included alongside other options for completeness.

Pricing note (checked 2026-03-05): prices below are vendor-published list prices where available. Quote-based vendors vary by contract; always verify on the vendor pricing page before purchase.

1. Runframe

Runframe is Slack-native incident management + on-call built for coordination during incidents (not just alerting).

Best fit if:

Incidents live in Slack and you want incident + on-call in one workflow
You want simple primary+backup escalation and clean handoffs
You care about audit-friendly timelines and post-incident reviews
You want self-serve setup measured in days, not quarters

Not a fit if:

You need full ITSM (requests/change/asset) inside Jira
You require complex enterprise telephony/global routing on day 1

Pricing: Free plan. $15/user/month, or $12 annually. No add-ons. No "contact sales." See pricing.

Setup time: 2-3 days self-serve.

Start with Runframe

OpsGenie → Runframe mapping (10-minute mental model):

OpsGenie Teams → Runframe Teams
Schedules / Rotations → Runframe On-call Rotations (primary + backup)
Escalation Policies → Runframe Escalation Rules (time-based steps)
Integrations → Runframe Integrations / Webhooks
Routing Rules → Runframe Routing Rules (service + severity aware)

If you're migrating, start by recreating rotations + escalation rules first. Then rewire integrations.

2. incident.io

Incident management platform with on-call scheduling and Slack integration.

incident.io focuses on incident management and on-call with Slack integration. The product includes incident workflows, status pages, and postmortem templates.

Pricing: (from incident.io pricing page)

Basic: Free (includes single-team on-call)
Team: $15/user/month (annual) or $19/user/month (monthly) for incident response
Team on-call add-on: +$10/user/month (annual) or +$12/user/month (monthly)
Pro: $25/user/month for incident response + $20/user/month for on-call
Enterprise: Custom

Setup time: 1-2 weeks

3. Grafana OnCall

Open-source alerting and on-call, now part of Grafana Cloud IRM.

Grafana OnCall started as open-source with full control via self-hosting. The OSS version entered maintenance mode on March 11, 2025 and will be archived on March 24, 2026. Grafana Cloud IRM (managed) continues development.

Pricing: (Grafana Cloud IRM)

Free: 3 active IRM users included
Pro: $19/month platform fee (includes 3 active IRM users) + $20/month per additional active IRM user
Enterprise: Custom (minimum annual commit applies)
OSS self-hosted: Free (maintenance mode; will be archived March 24, 2026)

Setup time: 1-2 weeks (more technical for self-hosted)

4. PagerDuty

Enterprise incident management with comprehensive features and complex workflows.

PagerDuty is the established enterprise player. Comprehensive feature set, strong compliance, extensive integrations. Configuration can be complex. Pricing scales quickly with add-ons.

Pricing: (list prices; check billing terms on vendor site)

Free: Up to 5 users
Professional: $21/user/month
Business: $41/user/month
Enterprise: Custom

Setup time: Weeks to months depending on complexity

5. Squadcast

Mid-market incident management with balanced features and complexity.

Squadcast positions between simple tools and enterprise platforms. Good feature coverage without overwhelming configuration. Competitive pricing for mid-sized teams.

Pricing:

Free: Up to 5 users
Pro: $9/user/month (annual) or $12/user/month (monthly)
Premium: $16/user/month (annual) or $19/user/month (monthly)
Enterprise: $21/user/month (annual) or $26/user/month (monthly)

Setup time: 1-2 weeks

6. Splunk On-Call

Enterprise incident management (formerly VictorOps) in the Splunk ecosystem.

Splunk On-Call brings incident management into Splunk observability. Strong for teams already using Splunk. Enterprise workflows and complex escalation rules.

Pricing: Varies by package and contract (contact for quote)

Setup time: Weeks

7. FireHydrant

Reliability-focused incident management with premium positioning.

FireHydrant positions as "upgrading, not replacing" incident management. Focus on reliability engineering, incident learning, and post-incident review processes.

Pricing: (from FireHydrant pricing page)

Platform Pro: $9,600 per year (up to 20 responders)
Enterprise: Custom

Setup time: Weeks

OpsGenie vs PagerDuty vs incident.io: Migration Cost Comparison

What does it actually cost to migrate from OpsGenie? Here's real math for a 20-person engineering team.

Total migration costs

One-time migration costs:

Schedule rebuilding: 20-40 engineering hours ($4,000-8,000 at $200/hr loaded cost)
Integration rewiring: 10-20 hours ($2,000-4,000)
Testing and training: 10-15 hours ($2,000-3,000)
Total one-time: $8,000-15,000

Monthly subscription costs (20 users):

Runframe: $300/month (annual) or $360/month (monthly) ($12-15 per user/month). Free plan available.
incident.io Team + on-call: $500/month (annual) or $580/month (monthly) ($25-29 per user/month)
PagerDuty Professional: ~$420/month ($21 per user)
Squadcast Pro: $180-240/month ($9-12 per user)
Squadcast Premium: $320-380/month ($16-19 per user)

Annualized costs (20 users):

Runframe: $2,880/year (annual) or $3,600/year (monthly)
incident.io: $6,000/year (annual billing) or ~$6,960/year (monthly billing)
PagerDuty Professional: $5,040/year
PagerDuty Business: $9,840/year
Squadcast Pro: $2,160-2,880/year
Squadcast Premium: $3,840-4,560/year

Hidden costs teams missed

From our interviews, teams underestimated these:

Integration gaps. Teams reported significant integration rebuild costs when direct replacements didn't exist (often 5–15 engineer-days total, depending on complexity).

Training time. Some teams reported 2-3 incidents in the first month after skipping training. Training investment: 2 hours per engineer ($8,000 for 20 people).

Parallel run period. Running both tools for 4-8 weeks costs one extra month of subscription. For incident.io Team + on-call (monthly billing), that's ~$620; for PagerDuty Professional, ~$420. Worth it to avoid incidents.

Re-migration. One team chose the cheapest tool and re-migrated 3 months later. Double all costs above.

What successful teams did

Teams who migrated well budgeted 2-3x their initial estimate. They included training time, parallel run costs, and buffer for integration gaps.

Teams reported wide variance in total costs depending on approach: those who planned thoroughly and ran parallel systems spent significantly less than teams who rushed migration and had to re-migrate.

How to Migrate from OpsGenie: 30-Day Plan

Three teams in our research migrated from OpsGenie. Here's a realistic timeline based on what worked.

Simple setups: 4-8 weeks. Complex setups (20+ integrations, layered rotations): 8-16 weeks. This 30-day plan gets you started and reduces risk.

Week 1: Audit and export

Days 1-2: Complete inventory

List everything:

All integrations (teams had 5-30)
Escalation policies - document logic, not just rules
On-call rotations including primary, backup, layers
Custom routing rules
Users and roles
Notification preferences (SMS, email, Slack)

Days 3-5: Export everything

Export:

On-call schedules (CSV or API)
User list and roles
Integration configurations
Escalation paths and policies
Custom alert routing rules

Teams warned us: CSV exports don't import cleanly. Budget 1-2 weeks to rebuild schedules manually.

Days 6-7: Choose replacement

Start trials with 2-3 tools. Test with real scenarios, not demos. Look at alternatives above and evaluate based on actual needs.

Week 2: Setup and configure

Days 8-10: Recreate core structure

Set up:

Users and roles
On-call schedules (hardest part per interviews)
Escalation policies

Days 11-14: Rewire integrations

Start with critical integrations. Test alert routing. Verify Slack, email, SMS delivery.

Tip from teams: Some integrations won't have direct replacements. Budget time to rebuild from scratch.

Week 3: Test and train

Days 15-17: Run parallel

Keep OpsGenie active. Route test alerts to new tool. Verify all paths work. Don't assume. Test.

Days 18-21: Team training

Run mock incidents. Train on on-call handoffs. Document new processes. Get feedback from on-call engineers.

Teams we spoke to who skipped this were more likely to have incidents during cutover.

Week 4: Cutover

Days 22-25: Soft launch

Route 50% of alerts to new tool. Monitor for issues. Be ready to roll back.

Days 26-28: Full cutover

Route 100% of alerts. Keep OpsGenie active 1 week as safety net.

Days 29-30: Decommission

Verify all integrations switched. Cancel OpsGenie access. Archive old data if needed.

What worked for successful teams

From interviews, teams who succeeded did this:

Test with real incidents before full cutover. Teams we spoke to who skipped this were more likely to have issues during cutover.
Don't underestimate schedule migration. Top complaint from interviews.
Run parallel for at least 1 week. Teams we spoke to who cut over immediately were more likely to encounter incidents.
Document everything as you go. You'll forget why you set up rules certain ways.

Additional Considerations: Coordination vs Alerting

This framework reflects how some teams evaluate alternatives beyond feature checklists.

Why Coordination Beats Alerting in Incident Management

Most tools above handle alerting well; the differentiator is how they help teams coordinate during incidents.

The real problem is coordination. Teams waste 40+ minutes per incident on coordination overhead. This is based on our interviews and analysis in our MTTR research with 25+ engineering teams.

The coordination problem

Most teams migrated to reduce MTTR. But switching tools didn't help because the problem wasn't alerting. It was coordination.

Coordination means:

Knowing who's doing what in real time
Status updates without bugging on-call engineers
Stakeholder comms that don't interrupt response
Context in one place, not scattered across tools

Alerting means:

Phone rings
Someone acknowledges
Incident created

Every tool does alerting. Not every tool does coordination.

Context switching kills MTTR

Teams with lowest MTTR in our research had one thing in common: minimal context switching during incidents.

If your incident tool lives outside Slack, you're context switching. If status updates require bugging on-call engineers, you're creating friction. If stakeholders can't self-serve status, you're creating noise.

What to look for when evaluating OpsGenie alternatives

Ask these questions:

Does it unify incident context in one place? Not scattered across tools.
Is Slack integration native or bolted on? Big difference.
Can stakeholders see status without bugging on-call engineers?
Does it reduce context switching or add more tools?

The tool that answers these correctly is the one that actually reduces MTTR.

Read our coordination framework for complete data and incident severity level guidelines.

FAQ: OpsGenie Migration

When is OpsGenie shutting down?

How to Reduce MTTR in 2026: The Coordination Framework

Niketa Sharma — Mon, 19 Jan 2026 00:00:00 GMT

Every engineering leader has been there. Phone rings at 2 AM. Something's down.

The question running through your head: How long until we're back?

Not "What's broken?" Not "Who's on-call?"

"How long is this going to hurt?"

Teams that can answer that question with confidence? They sleep better.

Teams that can't? They're guessing. And guessing is stressful.

MTTR isn't a vanity metric. It's what lets you answer the 2 AM question without guessing.

Here's what most teams get wrong: they focus on debugging faster, but the biggest wins come from detecting incidents sooner and coordinating cleaner.

Why This Isn't Another "10 Tips to Reduce MTTR" Article

Googling "how to reduce MTTR" gives you hundreds of articles with the same generic advice:

"Improve your monitoring"
"Have runbooks"
"Assign clear roles"
"Learn from incidents"

This advice isn't wrong. It's just incomplete without context.

Generic advice assumes every team is at the same stage. But a 15-person startup doesn't need the same thing as an 80-person scale-up.

This article isn't 10 generic tips. It's about which problems actually matter at YOUR stage, and which ones you can ignore.

The Three Types of Teams (And Which One You Want to Be)

Based on our conversations with 25+ engineering teams, we see the same three patterns over and over.

Type A: "We're Too Small to Track Metrics"

What they say:

"We're 20 people. We have like 3 incidents a month. Why do I need another metric to track? I know when things are broken."

What actually happens:

Incident happens at 11 PM on a Friday
No idea if this is normal or "really bad"
Customer asking "when will this be fixed?" and you're guessing
Post-incident, someone asks "how long was that?" and nobody knows for sure

The problem: You're flying blind. Every incident feels like a crisis because you have no baseline.

What we tell them: You don't measure MTTR to impress your board. You measure it so that when things break at 2 AM, you can say "We'll be back in ~45–60 minutes" and actually mean it.

A common effect: once teams know their baseline, incidents feel less like panic and more like routine execution.

Type B: The "Yeah, Like 2 Hours?" Crew

What they say:

"We track incidents. I mean, we know roughly how long things take."

What actually happens:

Someone asks "What was MTTR last month?"
Response: "Uh, like 2 hours? Maybe?"
Or someone spending hours calculating it from logs and tickets

The problem: If you need a person to calculate MTTR, you don't have MTTR, you have manual reporting.

Type C: The "Our Process Is Making Everyone Miserable" Trap

What they say:

"We have a mature incident process. MTTR is part of our quarterly goals."

What actually happens:

12-field incident forms that nobody fills out properly
Incident review meetings where people justify why something took 4 hours instead of 3
Teams stop declaring incidents to avoid "hurting the metrics"

The problem: If your incident process adds more work than it removes, engineers will route around it (and your data becomes fiction).

What we tell them: Your MTTR process should be invisible. If engineers are thinking "ugh, now I have to do the incident paperwork," you've failed.

So What Actually Works?

Fast teams do these three things:

1. Measure MTTR From Day One (Even If You're Small)

Why: Confidence, not metrics

When you're 15 people and having 3 incidents a month, knowing your average MTTR means:

New incident happens → You know if this is normal or "oh shit, this is bad"
Customers ask "when will this be fixed?" → You can give a real answer, not a guess
Post-incident review → You have data, not feelings

How simple can it be?

Incident #23: API outage
Declared: 2:34 PM
Resolved: 3:19 PM
MTTR: 45 minutes

That's it. You don't need a dashboard. You need a spreadsheet to start.

2. Make It Automatic (No Manual Work Allowed)

The rule: If an engineer has to manually enter data to track MTTR, your process is too expensive.

What works:

Incident declared → Timestamp auto-recorded
Incident resolved → Timestamp auto-recorded
MTTR = Calculated automatically

3. Keep the Process Lightweight

The trap: You start with good intentions ("let's track some useful data") and end up with a 12-field incident form.

Minimal required fields:

Incident title
Severity (P0/P1/P2)
Assigned to
Status (Investigating / Identified / Monitoring / Resolved)

Everything else is optional.

If you make 12 things required, engineers will either hate you or put garbage in half the fields. Keep the required fields tiny. Collect the rest later if needed.

The MTTR Math Nobody Talks About

MTTR isn't one thing. It's three:

Time to Detect: Incident happens → You notice (also called MTTD)
Time to Coordinate: You notice → Right people working on it
Time to Fix: Start debugging → Service restored

Total MTTR = Detection + Coordination + Fixing

Stop the Spreadsheet Toil

Don't calculate these metrics by hand. Use our Free MTTR & Reliability Calculator to get your P50 and P95 benchmarks instantly.

Here's the insight most teams miss:

Most teams optimize "Time to Fix" (better debugging, faster deploys).

But the fastest teams? They optimize Detection and Coordination first.

Why:

Better alerting (detect 10 min faster) = 10 min saved
Clear roles + dedicated channel (coordinate 8 min faster) = 8 min saved
Faster debugging (fix 5 min faster) = 5 min saved

The math: Improve detection + coordination = 18 minutes saved per incident. Improve debugging = 5 minutes saved.

How Teams Actually Reduce MTTR

Comparison of MTTR reduction approaches showing time saved, effort required, and recommended priority
Approach	Time Saved	Effort	When to Do It
Faster Detection	10-20 min/incident	Low	Do first - biggest ROI
Better Coordination	8-15 min/incident	Low	Do second - cheap wins
Faster Debugging	5-10 min/incident	High	Do last - hardest to improve
Add more tooling	-5 min (slower!)	Medium	Avoid - adds coordination tax

Teams that optimize detection + coordination see 20-30% MTTR reduction in 3 months with minimal engineering effort.

The MTTR Trap: Why "Lower is Better" Can Be a Lie

If your MTTR is dropping but your customer churn is rising, you have a measurement problem.

The Flaw: Aggregating SEV3 (minor) and SEV0 (catastrophic) incidents

When you lump all incidents together, you're averaging apples and oranges. A 2-hour SEV3 (minor feature broken) is completely different from a 2-hour SEV0 (payment processing down).

What happens: Your overall MTTR looks great because you're closing lots of quick SEV3s. But your SEV0 MTTR could be getting worse, and those are the incidents that actually matter.

The Fix: Segment your MTTR by Severity

A 4-hour SEV3 is fine; a 4-hour SEV0 is a business-ending event.

Track these separately:

P0 MTTR: Customer-facing outages (this is what keeps you up at night)
P1 MTTR: Degraded service (important but not critical)
P2 MTTR: Minor issues (nice to track, but don't stress about it)

The teams that sleep soundly at night? They know their P0 MTTR is 45 minutes. They don't care that their P2 MTTR is 4 hours.

Practical Guide: MTTR by Company Stage

If You're Under 20 People

Do this:

Start a spreadsheet (yes, really)
Track: Incident #, title, severity, declared time, resolved time, MTTR
Review monthly: "Are we getting faster or slower?"
Track P0 incidents (customer-facing); skip P2s (too much noise)

Start with P0 only if you want it even simpler.

Don't do this:

Build fancy dashboards
Set MTTR goals (you don't have enough data yet)

Goal: Get enough data to know your baseline. After 20-30 incidents, you'll see patterns.

If You're 20-80 People

Do this:

Move from spreadsheet to an actual tool
Make MTTR tracking automatic (no manual work)
Track by severity: P0 MTTR, P1 MTTR
Look for outliers: "Why did this P0 take 4 hours when average is 45 minutes?"

Don't do this:

Make engineers fill out 12-field forms
Set arbitrary MTTR reduction goals ("reduce by 20%!")
Game the system by not declaring incidents

Goal: Understand what's driving your MTTR. Is it detection time? Fix time? Coordination issues?

If You're 80+ People

Do this:

Track MTTR by service (is API slower than frontend?)
Track by time of day (are 2 AM incidents slower?)
Track by incident commander (is everyone getting faster, or just a few people?)
Use MTTR to identify systematic issues, not blame individuals

Goal: MTTR is one input among many. Don't optimize it at the cost of everything else.

What Actually Reduces MTTR (Besides Metrics)

Tracking MTTR doesn't reduce it. Actions reduce MTTR.

1. Faster Detection (Not Faster Fixing)

Most teams focus on "how do we fix incidents faster?"

But the teams with the best MTTR? They focus on detecting incidents faster.

A common pattern: the biggest wins come from faster detection and cleaner handoffs, not shaving minutes off debugging.

Without clear severity classification, you can't prioritize detection efforts. Use our Incident Severity Matrix to standardize how your team classifies incidents.

What to do:

Better alerting (not more alerts, better alerts)
Runbooks that say "if this alert fires, check X first"
On-call coverage that's explicit (and tested)

2. Reduce Coordination Overhead

You know what kills MTTR? Not the technical fix. The coordination.

The worst incidents aren't the hardest technical problems. They're the ones where three people are debugging the same thing, nobody knows who's doing what, and stakeholders are emailing every 10 minutes asking for updates.

Coordination overhead isn't just an MTTR problem, it's an engineering productivity killer. Read our Engineering Productivity Framework to see how top teams minimize context-switching during incidents.

What to do:

Declare incidents properly (create a dedicated channel)
Assign roles (incident commander, scribe, technical lead)
Status updates every 30 minutes (even if "still working on it")
One place for updates (not scattered across tools — manage incidents directly in Slack)

3. Have Runbooks (Even Simple Ones)

Teams with runbooks fix incidents faster.

What to do:

Document your top 5 recurring incidents
For each: What to check first, what to check second, who to escalate to
Keep them simple (one page or less)
Update them after incidents (if the runbook was wrong, fix it)

4. Learn from Every Incident

The fastest teams aren't just fixing incidents faster, they're learning from each one to prevent the next.

After the dust settles, run a post-incident review to capture what went wrong and what to change. Teams that do this see their MTTR drop 20-30% over 6 months, not because they're debugging faster, but because they're having fewer incidents.

MTTR Benchmarks: What's Typical

Everyone wants to know "what's a good MTTR?"

Based on our conversations with 25+ teams (20-180 people, mostly SaaS/fintech), here's what we see directionally:

Typical P0 MTTR ranges by company size based on industry data
Company Size	Typical P0 MTTR Range
Under 20 people	30-60 min
20-80 people	35-75 min
80+ people	40-120 min

Based on conversations with 25 engineering teams (20-180 people, SaaS/fintech). Use as directional guidance, not targets.

What this means:

If your P0 MTTR is 90 minutes, you're not "failing", you might have complex systems
If your P0 MTTR is 15 minutes, you're not necessarily "winning", you might be under-declaring incidents
Use these as sanity checks, not targets

The goal isn't to beat benchmarks. The goal is to know YOUR baseline and improve from there.

The Anti-Pattern: How Teams Game MTTR

We've seen teams do things to "improve MTTR" that actually make things worse.

Common ways teams game their MTTR metrics and the negative consequences
Gaming the System	What Happens
Don't declare P0s to avoid hurting metrics	Your "improved" MTTR is fake; you're actually slower at real incidents
Declare incidents as "resolved" when you've just band-aided the fix	MTTR looks great; recurrence rate explodes
Exclude "hard" incidents from MTTR calc ("that was an outlier")	You're lying to yourself about how fast you actually are
Set impossible MTTR goals ("all P0s must be fixed in 30 min")	Engineers stop taking incidents seriously because the goals are a joke

Do this instead:

Track MTTR honestly (include the ugly incidents)
Look at trends, not absolute numbers
Ask "why did this take 4 hours?" not "how do we hit an arbitrary target?"

What Good MTTR Tracking Looks Like

Based on teams that do this well, here's the pattern:

Automatic, not manual:

Incident declared → Timestamp auto-recorded
Incident resolved → Timestamp auto-recorded
MTTR calculated → No spreadsheets, no guessing

Lightweight process:

Required fields: Title, severity, owner, status (that's it)
Everything else optional
Engineers actually use it because it's not painful

Multi-dimensional analysis:

By service (which systems are slowest?)
By severity (P0 vs P1 vs P2)
By time of day (2 AM vs 2 PM incidents)

If your current tool makes engineers hate the process, find a better one.

(Disclosure: we're building Runframe. The principles above apply regardless of tool.)

What You Should Do This Week

If You're Not Tracking MTTR At All

Today (15 minutes):

Open Google Sheets
Columns: Incident #, Title, Severity, Declared Time, Resolved Time, MTTR
Fill in your last 3 incidents from memory

This week:

Track the next 5 incidents as they happen
After 5: Look for patterns ("Getting longer? Shorter? All at 2 AM?")

This month:

After 20 incidents: Calculate median P0 MTTR
That's your baseline

Goal: Stop flying blind.

If You're Guessing or Doing Manual Work

This week:

Ask your team: "How much time do we spend calculating MTTR?"
If answer is >30 mins/week → Too expensive
Write a simple script OR evaluate tools

Next week:

Implement automated tracking
Stop doing manual work

Goal: Free up time to reduce MTTR instead of calculating it.

If Your Process Is Making Everyone Miserable

Today:

Ask engineers: "What's the most annoying part of our incident process?"
List the top 3 annoyances

This week:

Remove 1 required field from incident form
Or: Cut incident review meeting from 60 min → 30 min
Or: Stop asking "why was this 4 hours instead of 3?"

This month:

Simplify until engineers stop complaining

Goal: Make MTTR tracking invisible, not painful.

FAQ

Q: What's a "good" MTTR?

Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

Niketa Sharma — Sat, 17 Jan 2026 00:00:00 GMT

A team told us someone paged the entire org at 3 AM because a dashboard was loading 200ms slower than usual. Meanwhile, actual customer-impacting outages got ignored because "everything is a SEV1."

When you're scaling from 20 to 200 people, it's tough to get severity levels right the first time. Without clear definitions, every incident feels like a crisis and on-call burns out. Here's what we've seen work across dozens of teams at your stage.

Without clear severity levels, you can't prioritize response. Teams often confuse incident response (fixing fast) with incident management (preventing recurrence). Read our incident management vs incident response guide to see why MTTR alone isn't enough.

TL;DR

We recommend SEV0-SEV4 (clearer than SEV1-SEV5, but start with what works for you)
SEV0 = catastrophic, SEV1 = core service down, SEV2 = degraded with workaround, SEV3 = minor, SEV4 = proactive
Classify in 30 seconds using: "Is revenue/users impacted? Is there a workaround?"
Consider adding SEV4 for proactive work (teams report it prevents 80% of incidents)
Severity ≠ Priority (severity = impact, priority = fix order)

SEV0-SEV4: The Framework

We recommend starting at zero, not one. SEV0 = zero room for error—it's more intuitive than SEV1 being your worst case.

That said, if your team is under 50 people, you might start with just 3 levels (SEV1-SEV3) and add SEV0 and SEV4 as you scale. Here's the full framework:

Complete SEV0-SEV4 framework showing impact description, response target time, and who responds for each severity level
Severity	Impact	Response	Who
SEV0	Catastrophic. Data loss, security breach, total outage, or critical revenue-impacting failure	Ack target: 15 min	War room (IC + core responders; exec notification depends on your org)
SEV1	Critical. Core service down for everyone	Ack target: 30 min	On-call + backup
SEV2	Major. Significant degradation, workaround exists	Ack target: 1 hour	On-call
SEV3	Minor. Limited impact, business hours fix	Business hours	Don't page
SEV4	Pre-emptive. Could break, proactive fix	Backlog	Owner + due window

The difference between SEV1 and SEV2? One question: Is there a workaround?

Checkout completely broken = SEV1 (no workaround). Search down but category browsing works = SEV2 (workaround exists).

Simple.

What teams at your stage say:

"Start with 3 levels. Don't over-engineer day one. You can always add SEV0 and SEV4 later."
— CTO, 40-person startup

"We added SEV4 when we hit 80 people. Prevented 38 out of 47 potential incidents in 6 months."
— Engineering Manager, Series B SaaS

Why SEV4 Matters (And When to Add It)

Many teams start without SEV4—it can feel like overhead when you're just trying to survive incidents.

"If nothing's broken, why track it?"

Fair question. Here's when it becomes valuable:

If you're under 50 people: You probably don't need SEV4 yet. Focus on responding to actual incidents first.

When you hit 75-100 people: This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically.

What happens without SEV4 at scale:

→ Disk space hits 100% at 2 AM (could have been SEV4 at 80%)
→ SSL cert expires, users see security warnings (could have been SEV4 at 30 days)
→ Database query gets 10x slower overnight (could have been SEV4 when it hit 2x)

Without SEV4, you're always reacting. Never preventing.

What Each Level Means

SEV0: The Building Is On Fire

Complete outage. Data loss. Security breach. Critical revenue-impacting failure.

Database corrupted? Multi-region outage? Authentication completely broken? Payment processing down?

That's SEV0. Wake everyone. War room. You have 15 minutes.

Real examples:

Database corruption with data loss (can't recover from backup)
AWS us-east-1 down AND your backup region failed
Security breach exposing customer data
Authentication completely broken (nobody can log in)
Payment processing down (revenue loss >$10K/hour)

SEV1: Core Service Down

Major impact but not catastrophic. Core service unavailable for most/all customers, with no workaround.

API totally down. Checkout completely broken. Search gone (if search is a core workflow for your product). Auth intermittent for a meaningful subset of users.

Page on-call immediately. All hands on deck during business hours. 30-minute target.

Real examples:

Total API outage (all endpoints returning 500)
Checkout flow completely broken (can't process payments)
Search functionality down (core feature for your product)
Authentication intermittent (meaningful subset of users can't log in)
Performance degradation (APIs materially degraded, not just slower)

SEV2: Significant but Workaround Exists

Broken but usable. Meaningful subset of customers affected, or core functionality degraded but usable.

Checkout failing for some users? File uploads broken? API materially degraded but responding?

Primary on-call handles it. Don't wake backup. 1-hour target.

Real examples:

Checkout failing for some users (payment gateway issue for some cards)
File uploads completely broken (users can't upload, but can use existing files)
API materially degraded but usable (users can still complete key workflows, possibly slower)
Dashboard not loading (users can still use core product)
Single region degradation (multi-region setup, one region struggling)

SEV3: Minor

Partial failure. Limited impact. Not urgent.

Profile pictures broken. Intermittent errors that auto-recover. Reporting delayed.

Fix during business hours. Don't page on-call. Can wait until morning.

Real examples:

Minor feature broken (user profile pictures not displaying)
Intermittent errors that auto-recover (happens a few times/hour, clears itself)
Reporting delay (analytics data not real-time, updates hourly)
Non-critical integration failing (Slack incident notifications delayed, email works)
UI polish issues (button misaligned, font wrong)

SEV4: Pre-emptive

Nothing broken yet. But something could.

Disk at 80%. SSL expiring soon. Query slowing down. Dependency vulnerability. Monitoring gap.

Create a ticket with an owner + due window (e.g., "this sprint" / "within 30 days"). No page needed.

Real examples:

Disk space at 80% (not critical yet, but will be in 2 weeks)
SSL certificate expiring in 30 days
Database query degrading (taking 2x longer, not failed yet)
Dependency vulnerability (CVE in a library, not exploited)
Monitoring gap discovered (no alerting for a critical service)

Classify Fast. Don't Debate.

Target: 30 seconds to classify.

When you're in the middle of an incident, speed matters more than perfection. If you're debating SEV1 vs SEV2 for 5 minutes while customers wait, just pick one and move on.

Pro tip: Default higher when uncertain. It's easier to downgrade a SEV1 to SEV2 later than explain why you under-classified and delayed response.

Is this catastrophic (data loss, security breach, total outage)? → SEV0

Is a core workflow blocked for most users?

No workaround → SEV1
Workaround exists → SEV2

Otherwise: limited impact → SEV3; not broken yet → SEV4

Tie-breaker: pick higher, note why, downgrade later.

Common Questions (What We've Learned from Teams at Your Stage)

"It's 2 AM and I'm not sure if this is SEV1 or SEV2"

Default SEV1. Assess the situation. Page backup only if blocked or primary hasn't responded within your escalation window.

You can downgrade in the morning. You can't un-break customer trust.

"Only 5% of users are affected, but they're our biggest customers"

Use your "materially impacted" definition. If those 5% represent 40% of revenue, it's material.

SEV1.

"The bug is cosmetic but our CEO is freaking out"

Still SEV3. Severity = customer impact, not internal panic.

But maybe add "Executive visibility" as a separate flag. Some teams use:

Severity: SEV3 (minor)
Priority: P1 (fix today)
Visibility: High (CEO watching)

This way you fix it fast without training on-call to page for non-issues.

"We fixed it in 5 minutes, do we still call it SEV1?"

Yes. Severity is based on potential impact, not duration.

If the database was completely down (even for 5 minutes), that's SEV1.

Duration doesn't change severity. It goes in MTTR metrics.

What Makes Severity Levels Actually Work

The key is specificity.

Vague (doesn't help at 3 AM): "SEV1 is when something important is broken."

Specific (makes decisions instant): "SEV1 is when a core service is down for all customers, with no workaround."

Frameworks That Actually Work (Choose Based on Your Size)

Startup Starter (20-50 people)

Start simple with 3 levels. Add more as you scale.

Starter severity framework for startups with 20-50 people
Severity	Impact	Response
SEV1	Core service down	Page everyone
SEV2	Degraded but usable	Page on-call
SEV3	Minor, can wait	Business hours

Scaling Company (50-150 people)

Add SEV0 when catastrophic incidents become possible.

Severity framework for scaling companies of 50-150 people with acknowledgment SLAs
Severity	Impact	Page Who?	Ack SLA
SEV0	Catastrophic	War room	15 min
SEV1	Core service down	On-call + backup	30 min
SEV2	Significant degradation	On-call	1 hour
SEV3	Minor issues	Business hours	1 day
SEV4	Proactive work	Backlog	None

Enterprise-Bound (150+ people)

Full framework with war rooms and executive escalation.

Enterprise severity framework for 150+ person organizations with SLAs and escalation
Severity	Impact	Page Who?	Ack SLA
SEV0	Catastrophic	War room	15 min
SEV1	Core service down	On-call + backup	30 min
SEV2	Significant degradation	On-call	1 hour
SEV3	Minor issues	Business hours	1 day
SEV4	Proactive work	Backlog	None

How to Evolve Your Severity Levels as You Scale

Starting with SEV1 vs SEV0

If you're under 50 people: Starting with SEV1-SEV3 is totally fine. Many teams do this.

As you grow past 100 people: Consider adding SEV0 for truly catastrophic incidents (data loss, security breaches). "Zero" = zero room for error, which makes the hierarchy more intuitive.

Why it matters: As your maximum possible blast radius grows, you need a tier above "critical outage" for existential threats.

When to Add SEV4 (Proactive Work)

If you're under 50 people: You probably don't need SEV4 yet. Focus on responding to actual incidents first.

When you hit 75-100 people: This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically.

What changes: Instead of jumping from "everything's fine" to "everything's on fire," you can track warning signs (disk at 80%, SSL expiring soon, query degrading) and fix them before they page someone at 3 AM.

One team added SEV4 at 80 people and prevented 80% of potential incidents over 6 months.

Ignoring Business Impact

The problem: Technical severity ≠ business severity. A "minor" pricing page typo can be catastrophic if it causes chargebacks.

The fix: Define severity in terms of customer impact and revenue, not technical complexity.

Severity vs Priority

Teams confuse these constantly.

Severity = Business impact (doesn't change)
Priority = Fix order (changes based on context)

Example:

Footer has a typo: "Contact sales@compnay.com"

Severity: SEV3 (minor impact, users can still email sales@company.com directly)
Priority: P3 (fix this week)

BUT: Legal says the wrong email violates our contract SLA.

Severity: Still SEV3 (customer experience unchanged)
Priority: Now P1 (fix today, legal risk)

Severity didn't change. Priority did.

Another example:

Database completely down.

Severity: SEV0 (catastrophic)
Priority: P1 (obviously)

But your lead DBA is on vacation and backup doesn't know the system.

Severity: Still SEV0 (impact unchanged)
Priority: Still P1, but now you escalate to vendor support

Severity = "how bad is it?"
Priority = "when/how do we fix it?"

Don't conflate them.

"Severity is 'how bad is it?' Priority is 'when do we fix it?' Don't conflate them."
— Engineering Manager, Series B Healthcare SaaS

Make It Work: Rollout Plan

Week 1: Start Simple

If you're 20-50 people: Copy the 3-level version (SEV1-SEV3) and customize examples to your product.

If you're 50-150 people: Use the 4-level version (SEV0-SEV3 or SEV1-SEV4).

If you're 150+ people: Go with the full 5-level framework (SEV0-SEV4).

The key is customizing examples to YOUR business. B2B looks different than B2C. Enterprise SaaS looks different than consumer apps.

Week 1: Get Buy-In

Share in Slack. Review in standup.

Most importantly: Get agreement from the people who'll be woken up at 3 AM.

If on-call hates it, they won't use it.

"The best severity framework is the one your team actually uses. If on-call hates it, they'll ignore it."
— SRE Manager, 180-person infrastructure company

Weeks 2-5: Use It

Classify every incident. Track how it goes.

Week 6: Iterate

After 30 days, ask:

Classification debates? → Clarify definitions
SEV3s waking people? → Make "don't page" explicit
SEV4s actually getting fixed? → It's working

Expect to adjust 2-3 times in the first 6 months. That's normal.

Quick Reference: During an Incident

Q: "Is this SEV1 or SEV2?"
A: Can customers work around it? Yes = SEV2. No = SEV1.

Q: "Only 10% of users affected. Still SEV1?"
A: Is that 10% material to your business? (Check your definition)

Q: "We fixed it fast. Was it really SEV1?"
A: Severity = potential impact, not duration. Yes, still SEV1.

Q: "CEO is panicking but customer impact is minor"
A: Severity = customer impact. This is SEV3. (But maybe Priority P1)

Q: "Not sure. What do I do?"
A: Default higher. Downgrade later if needed.

FAQ

Q: SEV0-SEV4 or SEV1-SEV5?

Incident Management vs Incident Response: What's the Difference?

Niketa Sharma — Thu, 15 Jan 2026 00:00:00 GMT

A VP of Engineering at a Series B startup said something that stuck:

"We're pretty good at incident response. Our MTTR is solid, people know what to do when things break. But incident management? That's a mess. We have the same postmortem discussion every month, nothing changes, and I can't tell you the last time we updated our runbook."

Calculate your MTTR → Free MTTR Calculator

Definition: Incident response

One-time, time-bound work during an active incident: declare, coordinate, restore service, and communicate.

Definition: Incident management

Ongoing work across the incident lifecycle: preparedness, runbooks, training, postmortems, and trend analysis to reduce recurrence.

He was describing something that tends to show up as teams scale: confusing two very different things.

Many teams are fast at fixing things but slow at learning. The same database outage happens every quarter. The runbook is 8 months out of date. Nobody reviews incident trends.

This article explains the difference, why it matters, and how to fix the imbalance in your incident management process.

Contents:

The Difference
Why teams confuse them
Failure modes
How to build both
What to focus on first
FAQ

The Difference

Side-by-side comparison of incident response versus incident management across key dimensions
	Incident Response	Incident Management
What it is	Tactical execution during an incident	Strategic oversight of the entire incident lifecycle
Timeframe	Minutes to hours (while incident is active)	Ongoing, always (between incidents too)
Goal	Restore service fast	Reduce incident frequency and severity over time
Mindset	Urgent, reactive	Deliberate, proactive
Key activities	Declare, coordinate, fix, communicate	Postmortems, runbooks, on-call, training, trend analysis
Success metric	MTTR (Mean Time To Restore)	Incident frequency, repeat incident rate, MTTD (mean time to detect), action completion rate
Who owns it	Incident Lead (temporary role during incident)	Engineering team (ongoing responsibility)
Skills required	Debugging, communication, decisions under pressure	Process design, facilitation, data analysis, coaching

Incident response is what you do during the outage. Incident management is what you do between outages.

Key takeaways:

Incident response restores service; incident management prevents recurrence
MTTR can improve while reliability worsens, if recurrence stays high, you're just getting faster at fixing the same problems
Friction kills follow-through. Make updates, runbooks, and action items easy if you want them to actually happen
The best teams treat incidents as a system to improve over time, not a series of one-off emergencies

If You Do Nothing Else This Week

Define severity (SEV0–SEV3) and response roles (Incident Lead, Comms, Fixer). Everyone should know what SEV0 means and who does what when it happens.

Set update cadence (every 15–30 minutes) and a single source of truth. Not DMs, not email threads. Just one place where everyone can see what's happening.

Require postmortems for SEV0/1 and "new failure modes." If you've seen this incident 10 times before, you don't need another postmortem. You need to finally execute on the previous one's action items. Track three metrics: repeat-incident rate, action-item closure rate, mean-time-to-detect (MTTD). MTTR matters, but repeat rate tells you if you're actually improving.

Do a 30-minute monthly incident review with one owner. Someone looks at the data and asks "what patterns do we see?" That's it. No marathon session, no slides, just pattern recognition.

Why Teams Keep Confusing Them

"Our MTTR is under an hour. We handle SEV0/1 incidents."

That was Sarah, an EM at a 60-person fintech company. Their MTTR was 42 minutes, solid. But underneath that, the runbook was last updated in March. They'd had the same connection pool exhaustion issue three times in six months. Postmortems were "whenever we get to it" (often never). No one looked at incident trends or patterns. On-call was "whoever's around." Build your schedule → Free On-Call Builder

They were confusing fast response with good management.

Then there's the friction problem.

Postmortems feel like homework because you're writing in a Google Doc, then copying to Confluence, then making a Jira ticket, then posting in Slack. Runbooks don't get updated because editing them is a pain. Trend analysis doesn't happen because you're exporting CSVs and making charts in spreadsheets.

One team put it: "We have 40-page runbooks that no one has opened in 6 months. I can't blame them. Editing them is terrible."

They're not undisciplined. They're working against friction.

Both teams treat incident management as an extension of incident response. But they're different disciplines. Response is tactical, urgent, short term. Fix the problem, execution, "how do we fix this?" Management is strategic, deliberate, long term. Fix the system, system design, "how do we prevent this?"

A 15-minute MTTR means nothing if the same outage happens every quarter.

What Happens When You Focus on Only One

Strong Response, Weak Management

Great MTTR but the same incidents keep happening. Postmortems are written but nothing changes. Runbooks exist but are outdated. No one knows if things are getting better. A 50-person B2B SaaS company had a database outage in January 2024, wrote a postmortem, then had the same outage in March, May, and again in December.

"I realized we'd never actually done anything the postmortem recommended. We just filed it away and waited for the next incident."

Fast at fixing, slow at learning. Stuck in reactive mode, never getting ahead of incidents. Great MTTR looks good on a dashboard, but if the same database outage happens every quarter, you're not actually improving. You're optimizing for speed while ignoring recurrence.

Strong Management, Weak Response

Detailed processes and runbooks nobody has read. Quarterly incident reviews but chaos during actual incidents. Great analysis culture but slow execution when things break. Roles unclear during incidents.

One Series A team shared their 40-page incident response handbook. It had been meticulously written by their former Head of Infrastructure. When asked who'd read it, the room went quiet. During their last SEV0, no one could find the escalation tree. The incident took 3 hours to resolve. It should have taken 45 minutes.

Great plans that fall apart in the heat of the moment. Great postmortems don't matter if customers wait hours for a fix that should take minutes. You're optimizing for learning while ignoring execution.

How to Build Both

Here's what good looks like, with specific examples.

Incident Response: Fast, Coordinated, Consistent

Good incident response isn't just fast fixing. It's coordinated fixing.

Bad response looks like: 15 people debugging the same thing, nobody coordinating, DMs scattered across Slack, nobody knows who's working on what.

Good response looks like: One person declares. One Incident Lead coordinates. One Assigned Engineer fixes. Updates in one place. Everyone knows who's doing what.

Clear roles are essential. The Incident Lead coordinates while the Assigned Engineer fixes. Split the work. Declare fast, say "This is SEV2" in 30 seconds instead of debating for 10. Keep updates in one place where everyone can see them, not scattered across DMs or email threads. If there's no response in 10 minutes, page backup immediately. And stabilize first: rollback beats fix-forward when customers are waiting.

This is tactical execution. It's what you do in the heat of the moment.

Incident Management: Continuous Improvement, Not Theater

Good incident management means reducing friction everywhere. When the right thing to do is also the easy thing to do, teams actually do it.

For postmortems, one team assigned action items IN the postmortem doc, not a separate Jira ticket. Teams with separate tickets struggle to close them, while inline assignments get done. They set deadlines 2 weeks out, not "Q2." Vague timelines equal never happens.

For runbooks, update them when things change, not 8 months later. Make them easy to edit. One team updated runbooks inline during postmortems, the facilitator types changes directly into the doc while everyone reviews. No separate Google Doc, no copy-paste to Confluence later.

For on-call, clear rotations. Not "whoever's around." Make handoffs frictionless. One team used a simple Slack bot that auto-assigned the next person in rotation. When the person who wrote the Slack script left, the rotation broke. Build for sustainability.

For trend analysis, someone reviews incident data monthly. Ask "what patterns do we see?" Make the data visible. One team set up an auto-generated CSV that posts to Slack every Monday. No manual exports, no spreadsheets.

For training, new engineers know the process before their first SEV0. Make learning accessible. One team does quarterly "game days" where they practice a simulated incident. No production stress, just learning.

The pattern: reduce friction everywhere. When postmortems are easy to write, runbooks are easy to update, and incident data is easy to see, teams actually do the work.

Which Should You Focus On First?

Guidance for which to focus on first (response vs management) based on your team's situation
Your situation	Focus on this first	Why
New team, first real incidents	Response	Don't even think about management until you've handled 10+ incidents. You can't design a system you haven't experienced.
MTTR solid but same fires recur	Management	Pick ONE recurring incident and fix it completely before building process. Process without a win feels like bureaucracy.
Incidents chaotic and slow	Response	Fix execution before you optimize for learning. Coordination breakdowns kill response speed.
Postmortems never lead to changes	Management	You have the response process. Now build the learning loop. Friction is the enemy. Make action items trackable in the postmortem doc itself.
On-call burnout high	Both	Response needs less chaos (coordination). Management needs better rotations (sustainability).

Quick wins by situation:

New team: Define SEV0/1, declare in Slack, assign one Incident Lead
Same fires recurring: Close ONE recurring incident's action items completely
Chaotic incidents: Use one Slack channel, one Incident Lead, updates every 15 min
Postmortems don't lead to change: Assign action items IN the postmortem doc with 2-week deadlines
On-call burnout: Set primary+backup rotation, use escalation rules

The Bottom Line

In practice, teams hit the same ceiling when they treat these as the same thing.

Both matter. Focus on only one and you hit a ceiling. Strong response, weak management means same fires every month, reactive forever. Strong management, weak response means great plans that fall apart when things break.

The best teams are fast at fixing things AND systematic about learning.

Don't be the team with 40-page runbooks no one reads. Don't be the team fighting the same database outage every quarter. Build both.

FAQ

Our MTTR is great but we keep having the same outages. What are we missing?

State of Incident Management 2026: Toil Rose 30% Despite AI

Niketa Sharma — Sat, 10 Jan 2026 00:00:00 GMT

TL;DR

We expected AI to reduce toil. Every report, every vendor, every conference deck said the same thing. But when we looked at the data from 20+ industry reports and spoke to 25+ engineering teams, we found something different.

Toil rose to 30% (from 25%), the first increase in five years.

Here's what's actually happening in incident management right now:

AI isn't delivering (yet): Many organizations are investing $1M+ in AI initiatives (51% deployed, 86% expect to by 2027), yet operational toil rose from 25% to 30%. The first rise in five years.
People are burning out: 78% of developers spend ≥30% of their time on manual toil. 73% of organizations experienced outages linked to ignored alerts (Splunk, n=1,855). This isn't sustainable.
The market is consolidating fast: OpsGenie is scheduled to shut down in 2027. Freshworks acquired FireHydrant. SolarWinds acquired Squadcast. Organizations are moving from "best-of-breed" stacks to unified platforms because they can't manage 7+ tools anymore.

65% of organizations now say observability directly impacts revenue (Splunk). Incident management has to keep pace.

And here's the part nobody wants to hear: while executives expect 171% ROI from AI investments, the reality is more complexity, not less. Developer toil can cost ~$9.4M/year per 250 engineers (simplified model). The "AI revolution" has paradoxically increased the blast radius of bad deployments for 92% of teams.

And it's getting more expensive to get it wrong. High-impact IT outages now cost ~$2M/hour (New Relic Observability Forecast 2025, n=1,700). Organizations lose a median of ~$76M annually from unplanned downtime (New Relic Observability Forecast 2025).

This report synthesizes 20+ industry reports and surveys published in 2025.

Scope: This report focuses on SRE/engineering incident response and operational toil, not security operations (SOC).

The 2025 Incident Index

Key 2025 incident management statistics and findings from industry reports
Finding	Statistic	Source
AI agents deployed	51%	PagerDuty, 2025
Expect AI agents by 2027	86%	PagerDuty, 2025
Expected ROI from AI	171% avg	PagerDuty, 2025
AI increases blast radius	92%	Harness, 2025
Toil percentage (up from 25%)	30%	Catchpoint, 2025
Devs spend ≥30% on toil	78%	Harness, 2025
Outages from ignored alerts	73%	Splunk, 2025
Developers work >40 hours/week	88%	Harness, 2025
Observability impacts revenue	65%	Splunk, 2025
High performers ROI advantage	+53%	Splunk, 2025
High-impact outage cost per hour	$2M	New Relic, 2025
Annual outage cost (median)	~$76M	New Relic Observability Forecast 2025
CrowdStrike global impact	~8.5M devices, >~$5B economic impact	Parametrix, Reuters, 2024

About This Research

Methodology:

20+ industry reports analyzed
25+ engineering team interviews conducted July - December 2025 (Series A to enterprise, 30-60 minute structured interviews)
Major incident analysis (CrowdStrike, AWS, OpenAI)
Published: January 2026

Why we wrote this:

We're building Runframe after talking to 25+ engineering teams about their incident management pain. The conversations kept surfacing the same themes: AI isn't delivering, alert fatigue is crushing teams, tooling is too complex.

This report synthesizes what we heard from across the industry. Disclosure: we're building Runframe. We've aimed to keep the analysis vendor-neutral.

Who should read this:

Engineering leaders evaluating incident management tools
SREs dealing with alert fatigue and burnout
CTOs planning 2026 tooling strategy
Anyone migrating away from OpsGenie

1. The AI Trust Gap: Why Toil Rose to 30% (From 25%)

What executives are betting on

51% of companies have already deployed AI agents (PagerDuty Agentic AI Survey 2025, n=1,000)
86% expect to be operational with AI agents by 2027
75% of organizations are investing $1M+ in AI
62% expect more than 100% ROI, with an average expected return of 171%
100% of organizations are now using AI in some capacity, and AI capabilities are now the #1 criterion for selecting observability tools (Dynatrace, n=842)

The hype is real. Executives are all-in.

What's actually happening

Operational toil rose to 30% from 25%, the first rise in five years (Catchpoint SRE Report 2025, n=301)
Enterprise incidents increased 16% YoY (PagerDuty State of Digital Operations 2024)
92% of developers say AI tools increase the "blast radius" from bad deployments (Harness State of Software Delivery 2025, n=500)

The first wave of AI deployments has added new layers of complexity: new tools to monitor, new alerts to triage, new skills to learn, and more code to review.

"What was most eye opening from our report findings this year was that, for most teams, it seems the burden of operational tasks has grown for the first time in five years. The expectation was that AI would reduce toil, not exacerbate it."

--Catchpoint SRE Report 2025

The implementation gap (not a tech failure)

69% of AI-powered decisions are still verified by humans (Dynatrace)
25% of leaders believe improving trust in AI should be a top priority

The technology isn't failing. Our implementation strategy is.

We're living through the awkward adolescence of AI. These are probably the worst versions of these models we'll ever use. Powerful, but prone to hallucinations, so humans still verify almost every action.

The rise to 30% in toil isn't because AI is bad. It's because we've added a "verification tax" on top of existing workloads without removing anything yet. Not fully autonomous, but no longer purely manual. The messy middle.

The rise of agentic AI in SRE

Multi-agent systems are now being deployed for complex incident resolution. AWS and others are shipping "agent" concepts aimed at reducing time-to-triage and time-to-mitigate (early-stage; outcomes vary). Platforms like Rootly, Harness, and PagerDuty are shipping AI-powered runbook execution and autonomous triage capabilities.

The future of AI in incident management is human-in-the-loop, not fully autonomous. AI suggests, humans approve.

Takeaway: Organizations invested heavily in AI expecting reduced toil. Instead, toil rose to 30% (the first rise in five years). The AI correction phase is coming in 2026.

2. The Burnout Tax: The $9.4M Cost of Silence

The $9.4M annual waste nobody talks about (Simplified Model)

78% of developers spend at least 30% of their time on manual, repetitive tasks (Harness)
Average software engineer salary: $125,000 (Indeed, Glassdoor, ZipRecruiter) (varies widely by market/level; treat ranges as directional)
30% toil × $125,000 = $37,500 of wasted investment per engineer annually
For organizations with 250+ engineers: ~$9.4M in lost productivity annually (simplified model: assumes $125k avg salary, 30% time on toil; actual costs vary by geography, role mix, and toil type) . See our build vs buy analysis for how these costs compare when building custom tooling

In our interviews, developers said the same things: frequent overtime leads to burnout, steals time from family, and eventually pushes them to leave.

For more on sustainable on-call rotations, see our On-Call Rotation Guide.

Alert fatigue increases the chance of missed signals

73% of organizations experienced outages linked to ignored or suppressed alerts (Splunk State of Observability 2025, n=1,855)
Industry analyses suggest as many as 67% of alerts are ignored daily (incident.io blog; underlying primary dataset not published)
Customer-impacting incidents increased 43%, each costing nearly $800,000 (PagerDuty Cost of Incidents study)

This is what we heard over and over in our interviews: teams are drowning in alerts. They've learned to ignore them. Then real incidents happen and nobody responds.

"Our on-call engineers get 200+ pages per week. Maybe 5 are real. The rest? Threshold noise, flapping alerts, things that auto-resolved. We've trained our team to ignore alerts, which is terrifying."

-- VP Engineering, Healthcare SaaS (160 engineers)

On-call burnout is at crisis levels

Unstable organizational priorities lead to meaningful decreases in productivity and substantial increases in burnout (DORA 2024 Report)

The firefighting trap

20% say they often or always start a "war room" with members of many teams until an issue is resolved, and 43% spend too much time responding to alerts (Splunk State of Observability 2025, n=1,855)
Teams are missing real signals in the noise. The ones that break out of this cycle prioritize alert hygiene: automated noise reduction, correlation, and routing alerts to the right person instead of everyone.

What this means: Alert fatigue increases the chance of missed signals. ~$9.4M/year lost per 250 engineers (simplified model). Burnout is at crisis levels. The 30-day rule: delete alerts nobody acts on.

3. The great consolidation: why best-of-breed is dead

Three acquisitions in 12 months

OpsGenie Shutdown (June 2025 - April 2027)

June 4, 2025: No new OpsGenie accounts can be created
April 5, 2027: Complete service shutdown
Forcing thousands of organizations to evaluate alternatives
Official Atlassian announcement | Read our migration guide

SolarWinds Acquires Squadcast (March 2025)

Announced March 3, 2025
Unifying observability and incident response
Press release

Freshworks Acquires FireHydrant (December 2025)

Freshworks acquiring FireHydrant's incident management platform
Folding it into their IT service and operations portfolio
Press release

Why this is happening

Nobody wants to manage 7 tools anymore. The integration points break, the licensing costs add up, and every new hire spends their first week learning logins. Vendors with unified data also have a real advantage building AI features, since they can correlate across the full incident lifecycle.

Teams are actively comparing incident.io vs. FireHydrant vs. PagerDuty. The OpsGenie shutdown deadline is accelerating migrations.

What this means: Three major acquisitions/shutdowns in 12 months. Teams are moving from 7-tool stacks to unified platforms because they have to.

Major incidents (2024-2025): why incident response mattered

Learn how to run incidents with clear roles and escalation in our Incident Response Playbook.

July 2024: CrowdStrike global outage, the $5B wake-up call

The Incident:

Impact: ~8.5 million Windows devices crashed globally (Reuters, citing Microsoft)
Duration: Some businesses recovered in hours; others took days
Business impact: Airlines grounded, hospitals disrupted, financial services halted; economic impact estimates exceed ~$5B (e.g., Parametrix analysis; methodologies vary)

Why Incident Response Was the Difference:

Organizations with established incident response processes recovered significantly faster. The difference wasn't technical architecture. It was whether anyone knew who was supposed to do what:

Companies with pre-defined escalation paths knew who could authorize system-wide changes
Teams with customer communication templates kept stakeholders informed instead of scrambling
Organizations with incident command structures avoided decision paralysis

"The difference between a 2-hour outage and a 2-day outage wasn't the bug. It was how quickly teams could coordinate remediation, communicate with customers, and execute rollback procedures."

October 2025: AWS US-East-1 outage, coordination chaos

The Incident:

Duration: ~15 hours (ThousandEyes)
Impact: Services across multiple industries affected
Business impact: Widespread service disruption; direct revenue impact varied by company

What Went Wrong:

For many organizations impacted by the outage, the breakdown wasn't infrastructure. It was incident response:

Unclear ownership: Teams spent critical hours determining who was responsible for what
Missing communication loops: Stakeholders learned about outages from social media, not internal updates
No pre-defined response: Organizations improvised instead of executing established playbooks

The Lesson:

Multi-region strategies help, but they're useless without incident management discipline. Some industry analyses claim organizations with documented runbooks and clear roles reduced their MTTR by up to 60% compared to those improvising (Xurrent; treat as directional). Calculate your MTTR → Free MTTR Calculator

December 2024: OpenAI ChatGPT outage, the recovery challenge

The Incident:

Duration: ~4 hours of global service disruption
Impact: Millions of users unable to access ChatGPT, API, and developer tools
Root cause: A new telemetry service deployment created Kubernetes circular dependencies (OpenAI status page)

The Hidden Story:

While OpenAI's official postmortem focused on the technical root cause, the incident illustrates a broader incident response challenge:

Recovery complexity: When systems have circular dependencies, recovery requires coordinated decision-making across multiple teams
Status communication: With millions of users affected, timely updates become critical, yet challenging without established communication protocols
Break-glass dilemma: OpenAI noted they're implementing "break-glass mechanisms" for future incidents, highlighting that manual recovery procedures must be defined in advance, not improvised during an outage

The Lesson:

When complex infrastructure fails, the difference between a 2-hour outage and a 4-hour outage often comes down to incident response discipline: pre-defined recovery procedures, clear escalation paths, and established communication channels. Technical root causes will happen; response processes determine how long they impact your business.

The pattern: alert fatigue causes real outages

Multiple 2025 incidents shared a common contributing factor: real alerts were ignored because teams were drowning in noise.

In our interviews, financial services teams reported outages extended by hours when preceding alerts were dismissed as noise
Healthcare SaaS teams told us incidents were delayed 20-30 minutes due to "is this real?" debate. That's time that matters when patient care is at stake
73% of organizations report outages caused by ignored or suppressed alerts

Alert noise isn't a monitoring problem. It's an incident management problem. Without proper routing, noise reduction, and escalation, teams train themselves to ignore notifications. Then real incidents happen.

"We've built an incident management system that cries wolf. Actual humans are paying the price when real incidents occur."

What we heard firsthand

We interviewed 25+ engineering teams while building Runframe, from Series A startups to Fortune 500 enterprises. Here's what they told us.

On AI adoption

"We deployed Copilot company-wide expecting a 30% productivity boost. Six months in, we're spending more time reviewing AI-generated code than we saved writing it. The junior engineers are the most affected. They're accepting suggestions they don't fully understand."
-- Engineering Manager, Series C Fintech (150 engineers)

"The AI tools are great for boilerplate. But for incident response? We tried an AI runbook assistant and it confidently gave wrong commands during a P1. We turned it off that night."
-- SRE Lead, E-commerce Platform (80 engineers)

On alert fatigue

"Our on-call engineers get 200+ pages per week. Maybe 5 are real. The rest? Threshold noise, flapping alerts, things that auto-resolved. We've trained our team to ignore alerts, which is terrifying."
-- VP Engineering, Healthcare SaaS (160 engineers)

On DevOps burnout

"We lost three senior SREs in six months. All cited on-call burden. These are people with 10+ years of experience who could work anywhere. We couldn't retain them."
-- CTO, Infrastructure Startup (60 engineers)

"I asked my team what would make their lives better. Number one answer: 'Fewer tools.' We use 7 different systems to manage incidents. Seven."
-- Director of Platform, Media Company (120 engineers)

On what's actually working

"The single biggest improvement we made was deleting 80% of our alerts. Not tuning them — deleting. If nobody acts on an alert for 30 days, it's gone. Our MTTA dropped by 40%."
-- SRE Manager, Gaming Company (90 engineers)

"We stopped doing weekly on-call rotations. Moved to follow-the-sun with 3 regional teams. Burnout complaints dropped to almost zero."
-- Head of Reliability, Global SaaS (175 engineers)

On market consolidation

"With OpsGenie shutting down, we had to migrate 200+ users. We chose a Slack-native alternative that meant no context switching. Our MTTR dropped 25% in the first month."
-- DevOps Lead, Series B SaaS (75 engineers)

What this means for 2026

The data is sobering. But the market is correcting fast, and the problems are finally measurable enough that leadership is paying attention.

1. AI tools will actually work (finally)

The first wave of AI tools shipped features. The second wave needs to ship outcomes.

The metrics that matter will change. Not "lines of code generated" or "suggestions accepted," but "did operational toil go down?" Human-in-the-loop approval for high-impact changes will become standard because nobody wants an AI deleting production databases unsupervised. And instead of one monolithic "AI assistant," we'll see specialized agents: one for triage, one for RCA, one for remediation, one for comms. Each doing one thing well.

The ~$9.4M/year toil cost (simplified model) is too expensive to ignore. The organizations that win here will be the ones whose AI reduces complexity rather than adding to it.

Prediction (Confidence: Medium): Q2-Q3 2026. The first wave of AI that actually reduces toil ships.

2. Alert fatigue gets solved (it has to)

73% of organizations experienced outages because real alerts got lost in the noise. The tooling to fix this exists. Most organizations just haven't deployed it.

AI-powered alert correlation is shipping from Splunk, Dynatrace, and newer players. 200 alerts become 3 actionable incidents. Context-aware routing sends alerts to the right person based on who's on-call, who owns the service, who fixed it last time. Self-healing loops handle known issues (connection pool exhaustion, cache miss storms) automatically and only page humans when remediation fails.

At the org level, more teams will adopt the "30-day rule": if nobody acts on an alert for 30 days, delete it. Not tune it. Delete it. We've seen teams cut MTTA by 40%+ doing this alone.

The cost of ignoring alerts is now measurable. Leadership cares. Budget will follow.

Prediction: H1 2026. Alert fatigue becomes a board-level discussion.

3. Consolidation creates better tools (not worse)

The "best-of-breed" stack era created integration hell. Seven tools, seven logins, seven contexts to switch between. Consolidation forces the industry to fix that.

What replaces it: platforms that handle the full incident lifecycle without context switching, that work where your team already works (Slack, Teams), and that have open APIs instead of walled gardens. Not "one tool for everything" but fewer tools that actually talk to each other.

The OpsGenie shutdown is forcing thousands of teams to re-evaluate their entire stack, not just find a drop-in replacement. That's a chance to fix 5+ years of accumulated tool sprawl.

Prediction: Throughout 2026. The "great migration" happens.

4. Incident response becomes a discipline (not just firefighting)

Incident management has been "whoever's around figures it out" for most teams. That's changing because the cost of improvising is now visible.

Incident Commander is becoming a trained role, not just "whoever got paged." Runbooks are evolving from static docs into interactive decision trees ("Is the database responding? No -> Try this. Yes -> Check this."). And SLOs are going operational: 50% of organizations are investigating or implementing them (Grafana Observability Survey 2025).

CrowdStrike and AWS showed the gap clearly. Companies that recovered in hours had playbooks. Companies that took days didn't.

Prediction: 2026-2027. Industry-wide shift from reactive to proactive.

5. Agentic AI gets real (with guardrails)

The "autonomous agents" hype will settle into something practical: constrained automation for known scenarios, with human escalation for everything else.

What that looks like: AI can restart a service. It can't delete a database without someone approving it. Triage agent, RCA agent, remediation agent, each with clear scope and boundaries.

In practice:

Incident declared. Triage agent analyzes symptoms, suggests root cause. RCA agent pulls relevant logs, identifies the failing deployment. Remediation agent proposes: "Rollback to v2.3.1?" Human approves. Agent executes. Communication agent posts update to status page.

That's 20+ minutes of coordination saved. The technology exists. The models have gotten dramatically better. 2026 is when the tooling catches up.

Prediction: Late 2026. First production-ready agentic incident systems ship.

The bottom line

2025 was hard. Toil went up. Burnout is real. Alert fatigue is crushing teams.

But for the first time, the problems are measurable. And what gets measured gets fixed.

~$9.4M/year in developer toil (simplified model). CFOs care now.
73% had outages from ignored alerts. Boards care now.
88% of developers work >40 hours/week. Retention is threatened (Harness, 2025).

Prediction (Confidence: Medium): Toil drops back toward 25%. Alert noise decreases 50%+. First incident response platforms that actually reduce complexity ship in 2026.

What engineering teams should do in 2026

If you're drowning in alert noise

Implement the 30-day rule: delete alerts nobody acts on for 30 days
Deploy correlation tools (Splunk, Dynatrace, or alternatives)
Measure your noise ratio. Target <20%

If your team is burning out

Audit on-call rotation: are people working >40 hours + on-call?
Implement recovery time: paged at 2 AM? Start late the next day
Consider compensation: $200-400/week or TOIL

If you're managing 5+ incident tools

List everything you use for monitoring, alerting, incident response, postmortems, on-call, status pages, and chat ops
Calculate total cost (licenses + engineering time maintaining integrations)
Evaluate unified platforms. The savings are usually bigger than expected

If you're migrating from OpsGenie

Timeline: June 2025 = no new accounts, April 2027 = shutdown
Key vendors to consider: PagerDuty, incident.io, and emerging platforms
Prioritize Slack-native workflows, alert correlation, unified platform
Read our complete OpsGenie Migration Guide for timelines, pricing, and step-by-step plans

If you're investing in AI

Measure toil before and after deployment
Implement human-in-the-loop for high-impact changes
Track whether operational toil actually decreased, not vanity metrics like "lines of code generated"

Need help? Get started free | Read our blog

Sources

Market News

Report Highlights

75% of organizations invest $1M+ in AI expecting 171% ROI. Operational toil rose for the first time in five years.

78% of developers spend 30%+ of their time on manual toil. For a 250-person team, that's ~$9.4M/year (simplified model).

73% of organizations had outages linked to ignored alerts (Splunk, n=1,855). ~67% of alerts may be ignored daily (incident.io blog; underlying dataset not published).

High-impact IT outages cost ~$2 million per hour. Organizations lose a median of ~$76 million annually from unplanned downtime.

About This Report

This research was compiled by the Runframe team. Published January 2026.

We're building Runframe because the problems in this report are real. If your team is dealing with alert fatigue, tool sprawl, or burnout, get started free at runframe.io.

Slack Incident Response Playbook: Roles, Scripts & Templates

Niketa Sharma — Wed, 07 Jan 2026 00:00:00 GMT

Your phone lights up at 2:47 AM. PagerDuty, Slack, maybe a phone call from your CEO. Something's broken. People are noticing. And you're the one who has to fix it.

We've talked to dozens of engineering teams about incidents. The thing that comes up over and over: the debugging isn't the hard part—it's the coordination. See our incident coordination guide on reducing context switching across tools and improving MTTR for more on why coordination matters more than speed. Calculate your MTTR → Free MTTR Calculator

Who's in charge? What do we tell customers? Why are 15 people asking for updates in DMs? Should we call a Zoom? Is this SEV1 or SEV2? Build your matrix → Free Severity Matrix Generator

The outage is the easy part. The chaos is what makes incidents last 3 hours instead of 30 minutes.

What Is Incident Response?

Incident response isn't debugging. Debugging happens after.

Incident response is what happens the second after the alert fires:

Declaration: Announcing the incident and severity
Coordination: Assigning roles (Incident Lead, Assigned Engineer)
Investigation: Finding and fixing the root cause
Communication: Keeping stakeholders and customers informed
Resolution: Confirming the fix and documenting what happened

Goal: restore service fast, then prevent recurrence.

Most Teams Get This Wrong

We talked to a 40-person B2B SaaS company that got hit with a SEV0 at 3 AM. Database went down. Checkout completely broken.

Want to know what went wrong?

No one declared it. People started debugging in DMs. 45 minutes in, the CEO joined Slack and asked "is anyone working on this?"

The person debugging was also trying to coordinate. They were updating support, fielding questions from leadership, AND trying to debug. Both suffered.

They kept saying "fixed in 5 minutes" - repeated every 10 minutes for 2 hours. Trust evaporated.

The incident dragged on not because the engineering problem was hard. It was because the coordination was broken.

Same team, next SEV0? They used a clear playbook. Resolved in 52 minutes. Same engineers, different process.

Incident Response Approaches Compared

Comparison of incident response approaches showing speed, coordination quality, team size fit, and failure conditions
Approach	Speed	Coordination	Works For	Breaks When
No playbook	Slow	Chaotic	<10 people	Any serious incident
Ad-hoc responses	Variable	Inconsistent	<30 people	Multiple concurrent incidents
Clear playbook (this approach)	Fast	Structured	20-200 people	Nobody follows it
Enterprise ITSM	Slow	Heavy process	200+ people	Too much overhead for smaller teams

Teams with clear playbooks resolve incidents 40-60% faster than ad-hoc responses.

What You'll Learn

The First 5 Minutes - Declare, assign roles, stabilize
Incident Roles - Who does what (Incident Lead, Engineer, Comms)
Severity Levels & Escalation - When to page, when to wait
Update Cadence - How often to post updates by severity
Customer Communication - Support scripts and status pages
Closing the Incident - Resolution summary and postmortem assignment
Common Anti-Patterns - What to avoid
Quick Reference - Checklists and decision trees

What Actually Works

In our conversations with engineering teams, the fast ones are consistent about seven things:

Key behavioral differences between slow and fast incident response teams and their impact
Slow Teams Do	Fast Teams Do	Impact
Debate severity for 10+ minutes	Declare in 30 seconds: "This is SEV2, fixing if needed"	Cuts coordination delay
One person tries to coordinate + debug	Split roles: Lead coordinates, Engineer fixes	Lower MTTR
Updates via DM or "hop on a call"	Updates in channel, pinned on severity cadence	Stops "any update?" pings
"Should be fixed in 5 min" (repeated)	"ETA unknown, investigating" then actual ETA when known	Trust maintained
Escalate after 30 min of silence	Response timer by severity: no response → page backup → EM	Faster time to fix
Forget support team until postmortem	Notify support immediately: "Here's your script"	Support not overwhelmed
End with "cool, it's fixed"	Post resolution summary + assign postmortem owner	Learning captured

Same engineers, different process.

The First 5 Minutes

Incidents live or die in the first 5 minutes. Declare fast, split roles, stabilize. The rest is details.

Step 1: Declare in 30 seconds

Post this in your incident channel:

🚨 Incident declared. Starting at SEV2 while we investigate.

Don't debate severity while production is burning.

An EM we interviewed put it bluntly: "We lost 15 minutes once arguing SEV1 vs SEV2. Meanwhile, customers couldn't check out. Just declare it. You can always downgrade later."

If anyone argues, say this:

"Let's start at SEV2. If it's worse, we escalate. If it's better, we downgrade. Arguing costs more time than fixing."

Step 2: Assign roles in 60 seconds

If no one steps up in 60 seconds, YOU do it.

👤 I'm Incident Lead. @bob is Assigned Engineer.

Or if someone else should lead:

👤 @alice is Incident Lead. I'll assist as needed.

Incident Lead coordinates. Assigned Engineer fixes. Split the work.

[!TIP]
If you don't pin the incident state immediately, you'll repeat yourself to every latecomer.

Step 3: Stabilize first, root cause later

Your goal is to restore service FIRST, understand SECOND. Every minute of downtime costs money and trust. Root cause analysis comes after customers are unblocked.

Use this priority list:

Rollback - If you deployed recently, roll it back. Now.
Failover - Switch to backup region, database, or cluster.
Kill switch - Disable the failing feature. Stop the bleeding.
Fix Forward - Only if rollback is riskier than a patch.

[!IMPORTANT]
Fix-forward is usually slower than rollback. If it's not trivial, prefer rollback.

Step 4: Set severity + start the response timer

Post this:

🔥 SEV2 - Checkout API errors, ~40% of transactions failing

Severity level quick reference guide showing when to use each level, example scenarios, and whether to page on-call
Severity	When to Use	Example	Page on-call?
SEV0	All customers down, business not operating	Checkout completely broken, 0% transactions	YES, immediately
SEV1	Major feature broken, significant impact	API down, 50%+ customers affected	YES, immediately
SEV2	Partial outage, some customers affected	Degraded performance, ~20% affected	Yes if ≥20% requests failing for 10+ min or checkout/revenue impacted.
SEV3	Minor issues, limited impact	Single feature broken, <5% affected	No

Escalation rules (by severity)

Escalation timeline by severity level showing when to page backup and engineering manager
Severity	Time to backup	Time to EM (if IC + backup unresponsive)
SEV0/1	5 minutes	10 minutes
SEV2	10 minutes	30 minutes
SEV3+	Handle async	Only if impact grows

Use the timer. Don't hesitate. For more on escalation paths, see our on-call rotation guide.

Step 5: Create the incident channel

One place for updates. If you jump on a call, paste a 2–3 line summary back here.

Name it clearly: #inc-checkout-api-2026-01-07 or #incidents-123

Post this as your first message:

🚨 INCIDENT DECLARED

📊 Severity: SEV2
👤 Incident Lead: @alice
🔧 Assigned Engineer: @bob
📝 Status: Investigating high error rate on checkout API
🕐 Started: 2:47 AM

💬 Updates: SEV0 10m · SEV1 15m · SEV2 15–30m · SEV3 30–60m
📌 Latest update will be pinned here

Pin that message. Latecomers shouldn't have to scroll.

Not running the incident? Stay out of the way.

Not Incident Lead or Assigned Engineer? Stay out of the way.

Don't:

DM the assigned engineer asking for updates
Hop on a call uninvited
Offer unsolicited advice

Do:

Check the pinned message
Post relevant info in the channel (logs, context, recent changes)
Let them work

The most helpful thing you can do is not add noise.

Incident Response Roles: Who Does What

Clear roles stop two things: silence and duplicate work.

Incident Lead (also called Incident Commander)

Your job:

Keep updates flowing (SEV0: 10m, SEV1: 15m, SEV2: 15–30m, SEV3: 30–60m)
Ask "what do you need?" not "what's the fix?"
Make the call: rollback vs fix forward, escalate vs wait, add people vs stay focused
Run interference so the Assigned Engineer can work

Your job is NOT:

Debugging
Writing code
Fixing the problem

If you catch yourself debugging, say this:

"I'm Incident Lead, I shouldn't be debugging. @charlie, can you take over investigation? I'll coordinate."

Assigned Engineer

Your job:

Fix the problem
Post updates when you have them (Incident Lead will remind you)
Ask for what you need

Your job is NOT:

Explaining what you're doing every 3 minutes
Managing the channel
Coordinating other people

If people keep DMing you:

"I'm heads down fixing. Check the pinned message in #incidents-123. If you need something, ping @incident-lead."

Ops Lead (optional, SEV0/1 only)

Add if: 3+ services failing OR 2+ teams involved OR access/permissions blocking progress

Don't add if: Single service, single team incident with clear path forward

🛠️ Operations Lead here. Access issues? Permission problems? Coordination across teams? Ping me.

Comms Lead (optional, SEV0/1 only)

Add if: SEV0/SEV1 OR need public status page OR support team getting hammered

Don't add if: SEV3 or no customers impacted

📣 Comms Lead here. Working on support script + status page. Engineers: focus on fixing. I'll handle the "any ETA?" questions.

Scribe (recommended for SEV0–SEV2)

Job: Capture timeline + key decisions for postmortem. In high-stakes incidents, Incident Lead is too busy to take notes.

Why split roles?

One person trying to coordinate AND debug? Both suffer.

A 50-person fintech company told us: "Splitting roles was the single biggest improvement to our MTTR. We used to have one person doing everything - coordinating, debugging, talking to support. Both suffered. Now we split it and incidents are way shorter."

[!TIP]
If nobody owns communication, customers assume the worst.

Incident Update Cadence by Severity

Post this cadence line: ⏱️ UPDATE CADENCE: SEV0 10m · SEV1 15m · SEV2 15–30m · SEV3 30–60m

📍 Current: [1 line, what users see]
🔄 Next: [specific action you're taking]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking, or "None"]

(Next update at: [time])

Every time you post an update, pin it. Remove the old pin.

Escalation: Use the Timer

Use the timer. Don't hesitate.

For "no response from IC": SEV0/1 → backup 5 min, EM 10 min. SEV2 → backup 10 min, EM 30 min. SEV3 → async.

For blocked decisions / multi-team: Page EM immediately.

If someone hesitates to escalate:

"This isn't about bothering people. It's about fixing the problem. If they're asleep and unresponsive, we need someone who isn't."

Incident Response Timeline Example

02:13 — PagerDuty: high error rate in checkout-api
02:14 — SEV1 declared in #incidents
02:15 — @alice takes Incident Lead, @bob is Assigned Engineer
02:16 — #inc-checkout-api-2026-01-07 created, incident state pinned
02:21 — Rollback decision (recent deploy noticed)
02:28 — Customer update posted + support script sent
02:35 — Rollback complete, errors dropping
02:41 — Stabilized, monitoring
03:05 — Resolved, postmortem owner assigned

52 minutes total. The key wasn't brilliant debugging. It was clear roles, regular updates, fast rollback.

Customer & Support Communication During Incidents

Support messages need four things: Issue / Customer impact / Action / Next update time

SUPPORT SCRIPT:

Issue: We're investigating an issue affecting [service/feature]
Impact: [who is affected + what they can't do]
Status: [investigating / identified / mitigating / monitoring]
Workaround: [if any, otherwise "None at this time"]
Next update: [time] (we'll post again even if ETA is unknown)

Status page updates

Guidelines for when to post public status page updates by severity level
Severity	Post public status update?	What to say
SEV0	YES, immediately	"We're investigating an issue affecting [service]. More details soon."
SEV1	YES	"We're investigating degraded performance on [feature]."
SEV2	Probably	If enough customers impacted, post an update
SEV3	No	Minor issues don't need public posts

Status page progression:

"We're investigating" → 2. "Identified the issue" → 3. "Fixing" → 4. "Resolved"

Internal stakeholders

Management will ask for updates. Give them a summary, don't let them micromanage.

Post this in #incidents-leadership or DM your EM:

👔 LEADERSHIP UPDATE

Incident: [Brief description]
Severity: [SEV0/1/2/3]
Status: [What's happening]
Who's fixing: @assigned-engineer
ETA: [If known]
Need anything: [What you need from leadership, or "Nothing, just keeping you informed"]

If leadership starts micromanaging:

"I understand this is stressful. The best thing you can do is let the team focus. I'll post an update in 15 minutes."

Closing the Incident: Resolution & Postmortem

Without proper closure, you're just firefighting. With it, you have an actual incident process.

Resolution needs: What broke / Why / What fixed / Preventing recurrence + postmortem owner + due date

✅ RESOLUTION SUMMARY:

What broke: [system/component]
Customer impact: [who/what/how long]
Why it broke: [cause, or "unknown"]
What fixed it: [rollback/fix/flag/scale]
What we'll do to prevent it: [1–3 bullets]

📝 Postmortem owner: @name
⏰ Postmortem due: [date, local time]
📎 Links: [incident channel] [dashboards] [PRs] [status page]

Assign postmortem owner

NOT necessarily the Incident Lead. They're probably tired.

📝 POSTMORTEM

@bob — you're up. Postmortem due by end of next business day (local time).
Focus on: What happened, why it happened, how to prevent it.
Incident timeline is in the pinned message.

Use our post-incident review templates to make postmortems faster.

If anyone pushes back:

"No deadline = no postmortem. Even a rough draft is better than nothing. End of next business day. If you need help, ask."

Close the incident

🔚 INCIDENT CLOSED

Thanks everyone. Clearing roles.
Channel will be archived in 24 hours (or per policy).
Postmortem discussion will happen in #postmortem-api-outage-2026-01-07

Incident Response Anti-Patterns to Avoid

These patterns show up in almost every team we talk to.

Hero mode

One person trying to fix everything alone. "I've got this."

Problem: Burnout and slower resolution. One person at 3 AM after 4 hours misses things that two fresh people would catch.

If you see hero mode:

🛑 @hero-engineer — you've been at this for 3 hours. Take a break. @backup-1 is taking over investigation for the next hour.

Silent debugging

No updates for 45 minutes while people wonder what's happening.

Problem: Latecomers ask the same questions over and over. Stakeholders DM random engineers.

If you see silent debugging:

⏰ @assigned-engineer — haven't seen an update in 30 minutes. Can you post a status? Even "still investigating" helps.

Blame hunting

"Who deployed this?" "Who wrote this code?"

Problem: Kills psychological safety. People hide incidents next time. Problems get worse.

If you see blame hunting:

🛑 STOP.

We don't care who deployed this. We care about:
1. What broke
2. Why it broke
3. How to fix it
4. How to prevent it

Save the "who" for the postmortem, and even then focus on systems not people. This maintains a [blameless culture](/learn/blameless-postmortem) where people feel safe reporting issues.

Meeting while it's burning

"Hop on a Zoom call" before you even know what's broken.

Problem: 10 people staring at each other while 1 person types. 9 people could be doing something useful.

A war room meeting during active mitigation is usually a coordination failure. Investigate first. Figure out what's broken. Only call a meeting if you need rapid, multi-person back-and-forth.

Optimism bias

"Should be fixed in 5 minutes" - repeated every 5 minutes for an hour.

Problem: Repeated missed ETAs destroy trust.

Say this instead:

⏱️ ETA: Unknown. Investigating.

Quick Reference Checklist

FIRST 5 MINUTES:

Declare it: "This is an incident, SEV2"
Name Incident Lead: "I'm taking Incident Lead" or "@alice is Incident Lead"
Name Assigned Engineer: "@bob is Assigned"
Pick severity (use cheat sheet)
Create channel: #incidents-name-date
Post template and pin it

DECISION TREE:

3+ services failing? → Add Ops Lead
2+ teams involved? → Add Ops Lead
SEV0/SEV1? → Add Comms Lead, page immediately
SEV2? → Updates every 15-30 min
SEV3? → Updates every 30-60 min
Missed update interval (SEV0/1)? → Page backup/EM
Missed update interval (SEV2)? → Check in
Stuck? → Say it early, page expert

UPDATE TEMPLATE:

📍 Current: [1 line]
🔄 Next: [specific action]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking or "None"]

ESCALATION:

SEV0/1: 5 min → "@backup — you're up"
SEV0/1: 10 min → "@em — need escalation"
SEV2: 10 min → "@backup — you're up"
SEV2: 30 min → "@em — need escalation"

CLOSEOUT:

✅ What broke, why, what fixed it, preventing recurrence
📝 Postmortem owner + deadline
🔚 Close incident

The Bottom Line

After talking to dozens of teams about their incidents, the same pattern keeps showing up: the teams that are good at this keep it simple.

Running a good incident isn't about frameworks. It's about five things:

Declare fast: 30 seconds, not 10 minutes. You can always downgrade.
Name roles: Incident Lead coordinates, Assigned Engineer fixes. Split the work.
Update regularly: On the severity cadence, pinned. No silent debugging.
Escalate when stuck: Use the response timer. Don't hero alone.
Close properly: Resolution summary, postmortem owner, done.

The best teams don't over-engineer. They don't have 50-page runbooks. They have a simple, repeatable playbook. Not sure which one you need? See Runbook vs Playbook: the difference explained.

Keep it simple.

FAQ

How long should an incident last?

On-Call Rotation: Schedules, Handoffs & Templates

Niketa Sharma — Fri, 02 Jan 2026 00:00:00 GMT

On a call last month, an engineering manager said:

"We have an on-call schedule in a Google Sheet. The problem is, nobody looks at it. When something breaks at 2 AM, everyone waits for someone else to speak up first. By the time someone actually responds, you've lost 20 minutes."

That's the moment the "informal" system starts costing real minutes. "Whoever's around" can work at 10–15 people. Around 40–50 people, it starts failing in predictable ways.

You have two options: keep winging it, or put in a rotation that's boring, explicit, and repeatable.

Across dozens of conversations, the teams that avoid burnout tend to converge on the same structure:

Here's what works.

TL;DR: Primary + backup (weekly). No-response rule (5 min). Written handoff (2 min). Visible in Slack daily. Recovery after overnight pages.

This guide includes:

3 copy-paste templates (handoff, escalation, rotation schedule)
Severity matrix (SEV-0 through SEV-3)
Compensation benchmarks ($200-500/week)
When to use spreadsheets vs tools
8 FAQ covering real edge cases

Based on conversations with 25+ engineering teams. Bookmark this-you'll come back to it.

What Is On-Call Rotation?

On-call rotation is a scheduled system where your incident response team takes turns being the primary responder for production incidents. It includes:

Primary responder - First person contacted when something breaks
Backup responder - Steps in if primary doesn't respond in 5 minutes
Clear escalation rules - When and how to page backup or manager. See: escalation policy
Defined time boundaries - Usually weekly (Monday 9 AM → Monday 9 AM)
Written handoffs - 2-minute transfer of context between shifts

The goal: 24/7 coverage without burning out any single person.

On-Call Rotation Approaches Compared

Comparison of on-call rotation approaches showing team size fit, failure point, and why each approach fails
Approach	Works For	Breaks At	Why It Fails
"Whoever's around"	<15 people	40+ people	Assumes everyone knows who to call
Solo on-call	Almost never	Immediately	No backup when they're unavailable
Daily rotation	Rarely	Always	Constant anxiety, no clean "off" time
Weekly primary + backup	20-100 people	Rarely (if done right)	Only if you skip recovery time
Enterprise tools	100+ people	Cost-sensitive <100	Overkill for team size

Why On-Call Breaks as Teams Grow

These were the most common failure modes:

Solo on-call. One person is "it" for the week. If they're sick, unreachable, or asleep through a page, you lose time fast. One 30-person team told me their on-call was out sick mid-week. The incident lasted 3 hours before someone finally called the CTO directly because nobody knew who to escalate to. Everyone paid for the ambiguity.

Office-hours-only coverage. "Maria is on-call 9–5." Then production breaks at 8 PM and people hesitate because "it's not covered." The "schedule" becomes an excuse to delay escalation.

Unknown escalation path. Who do you call when on-call doesn't respond? A Series B company wasted 45 minutes during a database outage because nobody knew who to escalate to. They had a backup on paper-nobody could name them under pressure.

Daily rotations. They look fair, but they keep people anxious because they're always "up next." You never get a clean "off" period. One team tried this and morale collapsed within weeks.

On-call as punishment. "You broke it, you're on-call." I heard this from three teams. It teaches people to delay reporting and quietly patch around problems.

No compensation or recovery time. Three teams told me they expected engineers to do on-call "as part of the job" with no stipend, no comp time, no acknowledgment. Two had someone quit within 6 months specifically citing on-call burden as the reason.

The Worst On-Call Setup I've Seen

A 35-person startup had monthly rotation with no backup and no escalation path. One person was expected to be available 24/7 for 30 days straight.

Three things happened:

Their best senior engineer quit after two rotations. "I couldn't plan anything for a month at a time. Every weekend was 'maybe I'll get paged, maybe not.' I couldn't commit to anything."

During one rotation, the on-call was at a wedding with no cell service. A database failure went undetected for 4 hours. Customers started emailing support before the team even knew there was a problem.

Junior engineers started refusing to do on-call. The rotation fell apart. The VP of Engineering personally covered 3 months straight until they redesigned it.

They switched to weekly rotations with backup. Turnover dropped. Nobody quit over on-call again.

Don't do monthly solo on-call. Just don't.

Why Teams Move Away From PagerDuty and Opsgenie

Migrating from OpsGenie? Read our complete migration guide with timelines, pricing, and step-by-step plans.

Before we get to what works, here's what doesn't: enterprise on-call tools for teams under 100 people.

The teams we talked to had similar complaints:

"Too complex for our size." A 40-person team: "PagerDuty has features we'll never use. We just need scheduling and escalation."

"Expensive for what we need." Another team: "We're paying $50+/seat. For our size, that's overkill."

"Not where we work." Multiple teams: "Our team lives in Slack. PagerDuty feels like another tool to check."

Most teams sit in this gap: too big for spreadsheets, too small (or too budget-conscious) for PagerDuty.

An On-Call Rotation Setup That Prevents Burnout

Most sustainable setups look like:

Primary + Backup + Escalation Rules

Primary is the first person to respond when something breaks. If primary hasn't responded in 5 minutes, page backup (any severity). If backup hasn't responded in another 5 minutes, escalate to the engineering manager for Sev-0/Sev-1. For Sev-2+, escalate at 30 minutes (or next business hours), unless impact increases.

A 40-person fintech team told me: "Primary for the week, backup as a safety net. The rule is simple enough that nobody argues in the moment."

The 5-minute rule is for no response, not technical escalation. It removes hesitation: when nobody responds, the clock decides. It forces visibility: if nobody responds, you've found a broken escalation path-fast.

Backup should be lower load by design. They're not expected to hover-just to be reachable. This fairness matters-backup burns people out less than being solo on-call.

Severity levels guide escalation timing:

Severity levels response targets and escalation rules
Severity	Description	Response Target	Escalation Rule
SEV-0	Complete outage, all customers down	Immediate	5 min → backup, 10 min → EM
SEV-1	Major feature down, significant impact	<5 minutes	5 min → backup, 10 min → EM
SEV-2	Minor feature down, some users affected	<15 minutes	30 min or next business day
SEV-3	Degraded performance, no customer impact	Next business day	No escalation needed

Use these response targets to maintain SLA compliance for your customers while protecting your team from burnout.

(More on compensation and recovery time below-it matters more than most teams realize.)

Page Policy (to prevent burnout):

Page only for customer impact, data loss risk, security, or hard downtime. Everything else becomes a ticket for business hours.

Weekly Rotations (Default for Most Teams)

Daily rotations are too stressful. Monthly rotations are too long. Weekly is the simplest cadence most teams can sustain.

"The Monday handoff became a predictable ritual. Everyone knew their week was coming and could plan around it," a staff engineer told me.

Some teams move to 2-week rotations once they have enough redundancy. Weekly is still the default for most.

Time Zones: Don't Page People at 2 AM Local Time

If your team spans time zones, on-call needs to account for that.

A global team (SF/London/Singapore) told me: "We used to have one global on-call. The person in SF was getting paged at 2 AM constantly. They fixed it with regional coverage blocks. SF covers SF hours. London covers EMEA. Singapore covers APAC. Much more humane."

If you can't do regional coverage, align on-call with your riskiest window (deploys, peak traffic, known batch jobs). If you're doing a big deploy on Friday, the on-call that week is someone who's around Friday-not someone taking Friday off.

Rule of thumb: if you routinely page someone at 2 AM their time, the system is mis-designed (rotation, alerts, or both).

Handoffs: 2 Minutes, Written, In Public

The teams that scale on-call keep handoff friction close to zero. Outgoing posts a short handoff note: what paged, what's unresolved, what to watch. Incoming replies to confirm ownership. If someone misses handoff, they post as soon as they're online (no silent gaps).

A 30-person infrastructure team: "Our handoff takes 2 minutes. Post what happened, acknowledge receipt, done. The teams that struggled had handoff meetings that nobody attended. Friction kills adoption."

These handoffs feed directly into post-incident reviews-document what happened so the whole team learns.

Make "Who's On-Call?" Impossible to Miss

The most common complaint I heard: "Nobody knows who's on-call."

The fix: make it visible where the work happens. Put it in Slack: channel topic + pinned message + a daily post. Ensure incident declaration tags the primary (and names the backup).

Pattern that works: a bot posts daily in #incidents - "On-call: @primary · Backup: @backup". That's it. Now everyone knows who to ping.

The teams that struggled had the information hidden in a spreadsheet. The teams that worked made it impossible to miss.

Compensation and Recovery Time

This came up in almost every conversation: on-call deserves recognition.

Money is the clearest signal. What I saw teams actually doing: $200-300/week for startups under 50 people, $400-500/week at larger companies. It's direct, it's fair, and it acknowledges that on-call is work outside normal hours.

If you can't do stipends, recovery time is non-negotiable. If you get paged overnight, start later or take the morning off-no permission needed. Many teams offer TOIL (time-off in lieu): if you spend 2 hours at 2 AM fixing an incident, you get 2+ hours off to recover. This directly addresses burnout.

Other recognition patterns: No on-call before or after vacations. Swap-friendly so people can trade shifts if they have conflicts. Public acknowledgment of on-call contributions.

A 25-person startup: "We give $200/week for on-call plus a comp day if paged overnight. It's not about the money. It's about recognizing the burden."

On-call has a real cost. If you can't pay for it, at minimum give time back. If you ignore both, you'll pay for it in attrition.

Why "Follow the Sun" On-Call Is Usually Overrated

A lot of advice says: "If you have global teams, do follow-the-sun on-call where each region covers their hours." Sounds great in theory. In practice? Many teams under 100 people don't need true follow-the-sun.

It can fragment context. When APAC hands off to EMEA who hands off to US, context gets lost. "Redis was flaky" becomes "something was weird" and the thread resets. One team told me: "We tried follow-the-sun. Half our incidents got worse because the person picking it up had no context."

It can hide a noisy-alert problem. If you're getting paged at 3 AM every night, the issue isn't your rotation-it's your monitoring. This causes alert fatigue, where your team stops responding because they're conditioned to ignore pages. Reduce pages first: tighten alerting, add runbooks, automate common fixes. Don't build a 24/7 rotation to work around noisy alerts.

Regional coverage is often enough. You don't need "follow the sun." You need "don't wake up someone at 2 AM in their timezone." Have a US on-call and an EMEA on-call. That covers 16+ hours. For the gap, either accept delayed response or rotate who covers it.

Exception: if you have true 24/7 SLAs and real usage across all time zones, follow-the-sun can be worth the complexity. But most startups have follow-the-sun guilt, not follow-the-sun need.

For more on managing incidents at scale, read our engineering productivity guide.

On-Call Rotation Template

This works for 20-100 person teams. Adapt it to your needs.

Setup time: ~10 minutes if you keep it simple.

The Setup

Set a clear boundary: Monday 9 AM → Monday 9 AM (local time). Coverage is primary (first responder) plus backup (5-min escalation). Handoff happens Monday morning in #on-call (written, not a meeting).

Example rotation for 6 engineers:

Week 1: Alice (primary), Bob (backup)
Week 2: Charlie (primary), Alice (backup)
Week 3: Bob (primary), Charlie (backup)
[Repeat]

For larger teams, add more people first; only then consider 2-week rotations.

Handoff Message Template

Every Monday morning, the outgoing on-call posts in #on-call:

👋 On-Call Handoff - Week of Jan 13 (Mon 9 AM → Mon 9 AM)

Outgoing: @alice → Incoming: @bob

Pages / incidents this week:
- Tuesday: Database alert, false positive
- Thursday: API latency, fixed by restarting cache

Notes for next week:
- Cache has been flaky, keep an eye on it
- Check the [runbook](/learn/runbook) for cache restarts if latency spikes again

@bob - can you confirm you're primary for this week?

Incoming on-call confirms:

✅ Confirmed, I'm on-call for this week

That's it. Two minutes. Done.

Escalation Path (No-Response Rule)

Write this down and put it everywhere:

Page primary (wait 5 minutes)
If no response: page backup at 5 minutes (wait 5 minutes)
If no response from backup: escalate to engineering manager at 10 minutes total (for Sev-0/Sev-1)

Note: For Sev-2+ incidents, escalate at 30 minutes or next business hours unless impact increases.

Slack Channels to Create

Create #on-call for handoffs, schedule updates, and meta discussion. Create #incidents for incident declarations and coordination only. Optionally create #incidents-private for customer details and security issues.

Common On-Call Rotation Scenarios (Copy/Paste Rules)

On-call person doesn't respond?

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Niketa Sharma — Mon, 29 Dec 2025 00:00:00 GMT

A few months ago, an engineering manager told us something that stuck:

"We write these postmortems like college essays. Then we never open them again."

He wasn't wrong. We've seen the same pattern across dozens of teams.

Someone spends two days crafting a 5-page Google Doc. Everyone nods during the review meeting. Then the doc gets filed away, never to be seen again, and six months later the same incident happens.

That's theater. It looks like learning, but nothing actually changes.

After interviewing 25+ engineering teams about how they handle incidents, we found a clear pattern: the teams that actually learn from incidents do things differently. Not more process. Simpler process that people actually use.

Here is what works, plus three postmortem templates you can copy and use right now. We call these post-incident reviews (PIRs), also known as postmortems. This is based on what teams told us actually gets used, not what sounds good in a doc. Need the full incident response workflow first? Start with our Slack incident response playbook.

What Is a Post-Incident Review (Postmortem)?

A post-incident review (also called a postmortem or incident retrospective) is a structured process for analyzing what happened during a production incident, why it happened, and how to prevent it from happening again. The goal isn't to assign blame—it's to learn from failures and improve systems.

Key components of an effective post-incident review:

Timeline - What happened and when
Root cause - Why it happened (system-level, not person-level). See: root cause analysis
Impact assessment - Who was affected and how
Action items - Specific steps to prevent recurrence
Shared learning - Documentation others can reference

Done right, post-incident reviews turn incidents from costly failures into valuable learning opportunities for the entire team.

Post-Incident Review Approaches Compared

Post-incident review approaches compared by time investment, team size fit, and when they fail
Approach	Time Investment	Works For	Breaks When
No postmortem	0 minutes	Never	Immediately - same incidents repeat
Verbal debrief only	15 minutes	<10 people, low stakes	Nothing documented, learning lost
5+ page document	2+ hours	Compliance requirements	Nobody reads it, action items ignored
1-page template (our approach)	30-45 minutes	Most teams 10-100 people	Blame culture or no follow-through
Enterprise RCA tools	3+ hours	200+ people, formal processes	Overkill for smaller teams

What Most Teams Get Wrong

Let's start with what doesn't work. If you've been through a few incidents, this will feel familiar:

The 5-page document problem

Teams write lengthy postmortems covering every possible angle: timeline, root cause analysis using five different frameworks, customer impact graphs, process flow diagrams, action items spread across three different sections, and a "lessons learned" section that's basically generic filler.

Nobody reads this. People who weren't in the incident won't read it. People who were in the incident already lived it, and they don't need a novel.

The blame problem

Even when teams say "no blame," the postmortem often reads like "what Sarah did wrong" or "how the database team broke production again." This is the opposite of a blameless postmortem culture where teams focus on systems, not people.

A Series B infrastructure team showed us a doc where every action item was assigned to a person, not a system. That killed the tone. The next time something broke, people waited until someone else spoke up first.

The timing problem

Some teams wait two weeks to do postmortems. By then, details are fuzzy. The urgency is gone. The emotional impact has faded. Action items feel optional.

The action item graveyard

We've seen so many postmortems with 15 action items, zero of which ever get done. There's no owner. There's no deadline. There's no follow-up. They're wishful thinking, not actual commitments.

What Actually Works (Based on 25+ Team Interviews)

The teams that actually learn from incidents keep it simple and repeatable. Here's the pattern we keep seeing:

Keep it short: one page max
The best postmortems we saw fit on one page. Sometimes less. A timeline, a root cause, and a few action items. Done.

A staff engineer at a 50-person fintech startup put it this way: "If we can't read it in five minutes, we're not reading it."
Do it within 48 hours
The fresher the incident, the better the postmortem. Details are still clear. Emotions are still raw enough that people care.

Two weeks later, the writeup gets vague. We heard this from a 20-person infrastructure team: "We kept pushing it out, then nobody wanted to reopen it."
Focus on systems, not people
Instead of "Sarah forgot to update the config," write "The deployment process doesn't validate config files." The fix isn't "Sarah should be more careful"; it's "add config validation to the deployment pipeline." This is the heart of a blameless postmortem culture.
Action items with owners and deadlines
Every action item needs a specific owner (not "the team"), a deadline (not "soon"), and a definition of done (not "investigate further").

A postmortem from a 40-person devops team had a single action item: "Add config validation to deployment pipeline." Owner: Maria. Due: Friday. Done. And guess what, it got done.

Aim for 1 to 3 action items per incident.
Share the learning
Postmortems shouldn't live in a Google Doc graveyard. Share them in Slack. Post them in a visible place. Make sure people who weren't in the incident still learn from it. This incident documentation becomes your team's knowledge base.

A Series B payments company keeps a single "#postmortems" Slack channel and links every doc there. That's enough.

A 15-person backend team at a developer tools startup told us: "We ship the fix fast, but if the postmortem isn't linked in the incident channel by end of day, it never happens." That simple rule made the habit stick.

Three Postmortem Templates You Can Use

Here are three downloadable post-incident review templates, from ultra-short to comprehensive. Copy whichever fits your team. We've used these with real teams and they work. If you just need an editable postmortem template to copy and paste, start with Template 2.

Download the templates:

Template 1: The 15-Minute Version

For small incidents that don't warrant a full meeting. Fill it out in the incident channel or a shared doc.

What you'll capture:

Incident summary (one sentence)
Impact (who, how long)
Root cause
One thing that went well
One thing to improve
One action item

Time to complete: 15 minutes max

Copy the 15-Minute Template →

Template 2: The Standard Version

For most incidents. Detailed enough to be useful, short enough to actually complete.

What you'll capture:

Incident details (severity, duration, impact)
Timeline (5 key moments)
Root cause analysis
What went well + what to improve
Action items with owners, deadlines, and status tracking
Follow-up tracking

Time to complete: 30-45 minutes

Copy the Standard Template →

Template 3: The Comprehensive Version

For major incidents (SEV0/SEV1s, customer-facing outages) that warrant a formal review.

What you'll capture:

Full impact analysis (systems, customers, business, detection)
Detailed timeline with who was involved
Root cause analysis (immediate, contributing, systemic)
Customer communication breakdown
Action items with definition of done
Prevention checklist (alerts, runbooks, deploys, resilience, testing)
Optional SOC 2 / Compliance addendum

Time to complete: 60-90 minutes

Copy the Comprehensive Template →

When Post-Incident Review Templates Won't Work

These templates are built for 10-100 person teams who want to move fast. If that's not you, here's what to consider:

Heavily regulated companies (SOC 2, HIPAA, FedRAM): Template 3 includes a SOC 2 / Compliance addendum with incident classification, data impact, control mapping, and evidence links. If you need more than that, you likely have formal compliance requirements beyond these templates.

Large organizations (200+ people, multiple teams): You likely have formal incident processes, change approval boards, and executive reporting requirements. A one-pager won't cover your stakeholders. Use these as a starting point, but expect to expand.

Blame cultures: If your organization uses postmortems to assign fault, these templates will backfire. They're designed for systems-focused, blameless analysis. Fix the culture first, then fix the documentation.

Everything else? Start with Template 2.

How to Actually Make These Stick

Templates are easy. Consistency is hard. Here's what the teams that stick with it actually do:

Schedule the postmortem immediately
Don't wait. Schedule it within 48 hours while the context is fresh. Put it on the calendar as soon as the incident is stable.
Keep the meeting under 30 minutes
If you can't cover it in 30 minutes, your postmortem is too long or the incident was too complex. Break complex incidents into smaller pieces.
Assign an owner
Someone needs to own the postmortem process. Not the incident commander; they're tired. Pick someone else who can gather info, draft the template, and make sure action items get tracked.

A 25-person platform team rotates this responsibility weekly so it never becomes "that one person's job."\n
Track action items to completion
The teams that actually learn from incidents don't just list action items; they track them. Effective action item tracking means someone checks: "Did we actually do what we said we'd do?" A 30-person infrastructure team uses a spreadsheet. A Series C SaaS company uses their issue tracker. What matters is that someone is verifying completion.
Share the learning
Post the postmortem in a visible place. Slack, a shared drive, or your internal wiki all work. Make sure people who weren't in the incident can still learn from it.

A healthcare startup with 12 engineers has a "#postmortems" Slack channel where every postmortem gets posted. Anyone can read them. Anyone can learn from them. It's simple. It works.

Post-Incident Review FAQs

How long should a post-incident review be?

Incident Coordination: Cut Context Switching, Fix Faster

Niketa Sharma — Mon, 22 Dec 2025 00:00:00 GMT

The outage isn't the problem. It starts the second after the alert fires. You're trying to diagnose what broke, but first you're fielding questions: who's leading this? Which channel? What do we tell support? Ticket or doc?
This tax compounds fast, and nobody talks about it. But incident management coordination overhead silently kills engineering productivity more than most team leads realize.
We talked to engineers and leads about how their teams handle incidents. Same story everywhere: no one needed another dashboard. They needed a way to coordinate without context-switching themselves to death.
This is what we learned, with no fluff. If you're looking for practical ways to reduce coordination overhead during incidents, keep reading.

What Is Incident Management Coordination?

Incident management coordination is how your team shares updates, assigns ownership, and stays aligned during a production incident. It's the communication and organizational layer that sits on top of the technical troubleshooting.

Effective incident coordination includes:

Clear ownership - Who's leading the response (usually the incident commander)
Status visibility - Current state and next steps
Context preservation - Key decisions and incident timeline
Role clarity - Who does what during the incident
Handoff protocols - How to transfer ownership
Escalation path - When and how to escalate incident severity levels

The problem: Most teams focus on technical diagnosis tools (monitoring, logs, traces) but neglect coordination tools. The result is context switching, duplicate work, and constant "what's happening?" questions that slow down resolution. This directly impacts MTTR (mean time to recovery) and mean time to resolution.

Good coordination doesn't fix the outage faster, but it removes friction so engineers can focus on the actual fix.

Incident Coordination Approaches Compared

Incident coordination approaches compared by setup time, team size fit, and failure conditions
Approach	Setup Time	Works For	Breaks When
Ad-hoc in Slack DMs	0 min	<10 people	Multiple incidents or unclear ownership
Single #incidents channel	5 min	10-50 people	Multiple concurrent incidents
Dedicated incident threads	10 min	20-100 people	Nobody enforces the pattern
Enterprise incident tools	Hours/days	100+ people, compliance needs	Too much overhead for team size
Note: If you're migrating from OpsGenie (shutting down April 2027), see our complete migration guide with timelines and pricing comparisons.
Custom internal tools	Weeks	Large orgs with dedicated platform teams	Maintenance burden

How Coordination Overhead Kills Engineering Productivity

1) Context switching kills flow when you need it most

During an incident, you're jumping between Slack, tickets, monitoring tools, a Google doc, and maybe a Zoom call (or virtual war room). Each switch feels like thirty seconds. But it adds up, and it murders your focus at the worst possible time.
Mid-sentence in the runbook, and suddenly you've forgotten what you were about to try. That lost flow repeats throughout the entire incident. Following the runbook becomes impossible when you're constantly context-switching.
The fix isn't another tool. It's fewer surfaces. Teams that felt less burned out had one place where coordination happened, usually Slack. The technical diagnosis still happened in Datadog or wherever, but status updates, decisions, and handoffs stayed in one thread.
What works? Make Slack your incident workspace, not just your alerting channel. Current status, who owns what, next steps-all in one place.

2) Your on-call schedule is invisible when it matters

Most teams have an on-call schedule. The problem? It's disconnected from where the incident is actually happening.
Small teams just know who to ping. As you grow past 30-40 people, that breaks down. Someone pings the wrong person, or everyone waits while the right person is in a meeting. Now you're playing operator instead of fixing the problem.

For more on on-call coordination, see our on-call rotation guide with weekly schedules, 5-minute no-response rules, and compensation benchmarks.
A team lead told us: "We had coverage. We just never knew who was actually paying attention right now."
The fix: Surface on-call info directly in the incident channel. Not a link to the on-call tool. The actual person's name, their backup, and how to escalate. Right there. Clear escalation paths prevent confusion during SEV-0 and SEV-1 incidents when every second counts.

3) Your postmortems exist but nobody reads them

Every team writes postmortems. Almost nobody reads them during the next incident.
They're too long. Too formal. Buried in Confluence. When you're in the middle of fixing something at 2am, you want a short list of what to check and what not to do. Format matters more than completeness.
An engineering manager put it: "We write these things like college essays and then never open them again."
Instead: Keep the learning short and keep it in the incident channel. A few bullets. What changed. What to watch for. Make it show up when the next similar incident starts. This incident timeline should be easily accessible during the next outage.

For post-incident review templates that work, see our post-incident review template guide with 3 downloadable formats.

Incident Management Best Practices from Fast-Moving Teams

The teams that moved fast didn't chase perfect process. They cut overhead. Same patterns kept showing up.

Work where people already are

If your team lives in Slack, making them use another tool is friction. This isn't about being "Slack-native" for marketing reasons. Engineers already have Slack open when the alert fires. That's just reality.
A team adopted a fancy incident tool and dropped it after a week. Their reason? One more tab to check while everything's on fire. The tool was fine; the workflow wasn't.
Make the incident channel your home base. Pin the current status. Post updates every 15-30 minutes. If someone joins late, they should read the pinned message and know what's happening. For customer-facing incidents, the incident commander should also update the status page to keep customers informed.

Automate the boring stuff, not the thinking

Light automation goes a long way. The best teams automated mechanical tasks, not judgment calls. They didn't want a bot making decisions. They wanted it to handle the busywork.
Good automation:

Creates the channel and invites the right people
Posts a status template
Logs incident timeline timestamps automatically
Assigns an incident commander automatically

Bad automation:

Spam notifications
Forces rigid steps when things are chaotic
Creates work just to feed the tool

Automate what clears the path. Don't automate what sets the route.

Stay invisible until needed

Nobody wants a tool that nags them on quiet days. The best systems disappear until an incident starts. That's how you get adoption-people don't feel like they're "using a tool" constantly.
If I have to update some system every time I make a config change, I'll just stop. That's human nature, not laziness.
Normal days should feel normal. Incident days should feel supported.

Three Incident Coordination Patterns from Real Teams

These aren't perfect playbooks. Just examples of what worked.

The team that kept it simple

They ran everything through a single #incidents channel. When something broke, they'd create a thread, name the owner in the first message, and keep all updates there. No separate ticket during the incident. Just one summary afterward.
Basic, but it worked because everyone agreed to follow it. The ritual was light.

The team that needed more structure

As they grew, communication overhead got painful. They added primary and backup on-call rotations and made one rule: all updates go in the incident channel. No side DMs. None.
That one rule cut confusion immediately. People stopped asking for updates because the updates were already there. More tools didn't help. More consistency did.

The team that stopped overengineering

A larger team evaluated an enterprise incident tool, tried it, and found it overwhelming. They switched to a lightweight workflow that ran entirely in Slack. Their test: If a new engineer can't run an incident after a 10-minute walkthrough, we simplify it.
They weren't anti-tool. They just hated friction.

Why Simple Incident Management Beats Complex Tools

Incident response is one of those areas where complexity feels responsible. More fields, more statuses, more process. But the teams with better outcomes cut complexity first.
Here's the thing: mature teams have clear practices. Not necessarily more practices. They know what to do when an incident starts. They don't waste time debating the process.
The easiest way to add complexity? Buy a tool that makes you define everything upfront. Feels safe. Feels comprehensive. Usually results in half-finished setup and partial adoption.
If you can't explain your incident process to a new hire in five minutes, it's too complicated.

5-Step Incident Management Checklist

Follow these steps for every incident:

1. Declare and assign (30 seconds)

Create incident thread in #incidents or dedicated channel
First message: "@alice is incident commander for checkout API errors"
Name severity level if clear (SEV0/1/2/3)

2. Post initial status (1 minute)

What's broken: "Checkout API returning 500 errors"
Current hypothesis: "Recent deploy may have broken payment processing"
Who's investigating: "@bob is debugging, @carol on standby"

3. Set update cadence and pin it (30 seconds)

Post: "Updates every: SEV0 10 min · SEV1 15 min · SEV2 30 min · SEV3 60 min"
Pin this message to the channel

4. Capture decisions as they happen (ongoing)

Rollback decision: "Rolling back deploy #1234 due to checkout errors"
Escalation: "Escalating to EM, stuck on database connection issue"
Workaround: "Disabled feature flag for affected region"

5. Post resolution summary (2 minutes)

What broke: [system/component]
Why it broke: [cause]
What fixed it: [rollback/fix/flag/scale]
Postmortem owner + deadline: "@alice, due EOD Thursday" (use our templates)

Total overhead: ~10 minutes for entire incident

Looking for Incident Management Software?

We're building incident coordination for Slack: auto-create incident channels, visible on-call ownership, status templates, and timeline capture without context switching. Built for teams 20-100 people who want coordination, not complexity.

Get started free

Want the next step? Read our post-incident review template guide with action-item tracking or our on-call rotation guide with burnout-prevention schedules.

Read the full research: Scaling Incident Management: What We Learned from 25+ Engineering Teams

Incident Coordination FAQ

What is incident response coordination?

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Niketa Sharma — Mon, 15 Dec 2025 00:00:00 GMT

Before building anything, we wanted to understand how teams actually handle incidents in production. Not the polished version from case studies or the theoretical best practices from SRE books (the messy, 3 AM reality of what happens when the database goes down).

Over the past few months, we conducted 22 calls and collected 5 async writeups from engineering teams ranging from 12-person startups to 180-person scale-ups (skewing toward teams already using Slack heavily). Some were using established incident management platforms, some were using newer tools, and a surprising number were still running incidents through ad-hoc Slack channels and Python scripts.

Looking for the practical guide? Read The Silent Killer of Engineering Productivity in Incident Management.

We asked the same questions: What works in your incident response? What breaks? What do you wish existed?

The conversations challenged a lot of our assumptions. We expected to hear about cost barriers and alert fatigue. Instead, the problems that kept teams up at night were setup complexity and coordination breakdowns.

What Is Scaling Incident Management?

Scaling incident management is the process of evolving your incident response practices as your engineering team grows. What works for a 10-person startup (informal Slack coordination) breaks down at 50 people (needs formal on-call rotations, dedicated tools, clear escalation paths). Build your schedule → Free On-Call Builder

Most teams go through four predictable stages:

Single Slack channel (5-15 people)
Python scripts (15-40 people)
"Should buy a tool" limbo (40-100 people) ← where most teams get stuck
Formal tool adoption (100+ people)

The challenge isn't technical—it's organizational. As teams grow, informal coordination ("whoever's around handles it") stops working. You need clear ownership, documented processes, and tools that reduce coordination overhead rather than add complexity.

This research examines how 25+ engineering teams navigated these transitions, what blocked them, and what actually worked.

Key Findings

✓ Most teams get stuck at Stage 3 (40-100 people). They've outgrown Python scripts but can't commit to enterprise tools

✓ Setup complexity blocks adoption, not cost. Almost no teams mentioned price as the primary barrier

✓ Coordination matters more than speed. The technical fix is usually straightforward; getting everyone aligned is the hard part

✓ 40-50 people is the inflection point. That's when informal "whoever's around" on-call stops working and formal rotations become necessary

The 4 Stages of Incident Management Maturity

The four stages of incident management maturity from startup to enterprise with team sizes, setup time, what works, and what breaks at each stage
Stage	Team Size	Setup	What Works	What Breaks
1. Single Slack Channel	5-15 people	5 min	Informal coordination, founder-led	Multiple concurrent incidents
2. Python Scripts	15-40 people	1 day	Auto-channel creation, some automation	Script maintenance, API changes, no docs
3. "Should Buy Tool" Limbo	40-100 people	Months of indecision	Nothing—stuck evaluating	Setup complexity, decision fatigue
4. Formal Tool	100+ people	1-2 weeks	Structured process, clear ownership	Feature overload, workflow mismatch

Most teams get stuck at Stage 3 for 6-12 months before a crisis forces Stage 4 adoption.

What You'll Learn

The 4 stages every team goes through (and why most get stuck at Stage 3)
Why teams avoid adopting tools (hint: it's not cost)
The "just works" gap that tools are missing
What actually matters: coordination vs speed
The on-call rotation inflection point
The pattern for success based on what worked for teams

The 4 Stages of Scaling Incident Management (And Why Teams Get Stuck at Stage 3)

This pattern showed up in roughly 20 of the 25+ conversations; the wording differed, but the structure was consistent.

Stage 1: The Single Slack Channel (5-15 people)

Everything goes into #incidents. One of the founders or senior engineers declares "we have an incident," people jump in, someone figures it out, and everyone moves on.

One CTO at a 10-person startup told us: "We have maybe two real incidents a month. Why would I pay $200/month for a tool when a Slack channel works fine?"

Fair point. At this stage, the Slack channel IS the incident management system.

Stage 2: The Python Script Phase (15-40 people)

Once you hit two concurrent incidents, the single channel breaks down. Conversations overlap. People lose track of who's working on what. The history becomes impossible to parse.

So someone (usually a senior engineer who's annoyed by the chaos) spends an afternoon writing a script that:

Creates a dedicated Slack channel per incident
Posts to a Notion page or Linear issue
Maybe tags the right people based on keywords

This works great. For a few months.

Then something changes: the engineer who wrote it leaves, gets promoted, or just stops maintaining it. Sometimes Slack's API changes. And weird things start happening.

We heard from an engineering manager at a Series B company: "Our script created 11 channels for the same incident last month. Turns out it was triggering on every alert notification, not just the initial one. Nobody caught it because the person who wrote it had left six months ago, and honestly, we were all scared to touch the code."

We asked to see the script. It was 380 lines of Python with zero comments and variable names like ch_id and usr_grp_2.

Before you rebuild, consider the real cost. Building custom incident tooling typically costs 3-8x more than buying over 3 years.

Stage 3: The "We Should Probably Buy a Tool" Discussion (40-100 people)

This is where we found most teams stuck. This is when teams need formal on-call rotations. See our on-call rotation guide with weekly primary+backup schedules and 5-minute escalation rules.

They've outgrown the janky script. Incidents are happening more frequently, maybe 8-12 per month now. The script breaks in new and creative ways. Everyone agrees they need something more robust.

So they start evaluating tools.

And then... nothing happens for months.

At first, we thought this was about price. The tools are expensive (roughly $15-20 per user per month from what teams shared, so ballpark $750-1,000/month for a 50-person team).

But when we dug deeper, price wasn't the main blocker.

A VP of Engineering explained: "We got budget approved for an incident management platform. Then our platform lead spent two weeks trying to set it up. He got frustrated with the escalation policies config and basically gave up. We're still using the script."

Another team had bought a tool, used it for one incident, and then just stopped. When we asked why, the EM said: "I think people found it easier to just create the Slack channel manually. We still pay for it, we just don't use it."

Stage 4: Finally Adopting a Tool (Usually Post-Crisis)

The teams that successfully adopted a tool almost always had the same trigger: a bad incident that exposed the gaps in their janky setup.

"We had a P0 on Black Friday," a CTO shared. "Our Python script was down (ironically) and we ended up with three different incident channels that people created manually, each with different subsets of the team. It was chaos. The next Monday I told our platform team: find a tool, get it set up, I don't care what it costs."

They adopted a modern incident management platform and were live within a week.

What struck us: this company had been "planning to adopt a tool" for over a year. The incident finally forced the decision.

Why Teams Avoid Incident Management Tools (It's Not Cost)

We went into these conversations assuming the barrier was price. SaaS incident management tools are expensive, and startups are budget-conscious.

Cost came up, but it wasn't the first thing teams complained about. Setup complexity and decision fatigue dominated the conversations.

The real barrier? Decision fatigue and setup overhead.

The Enterprise Platform Problem

Eight teams had tried to set up enterprise incident management platforms and abandoned the process mid-way.

The pattern was similar across most teams: An engineer starts the setup, gets to the escalation policies configuration, realizes they need to make a dozen decisions they don't have answers for, and just stops. Though we also heard about integration complexity and change management resistance as blockers.

"I opened the setup guide and it was 40+ pages," an engineer mentioned. "Questions like: How many severity levels do we need? What's our escalation policy? Who's the primary, secondary, and tertiary on-call for each service? We're 30 people. I don't know the answer to these questions. So I closed the tab and went back to our script."

Enterprise platforms are comprehensive. But comprehensive means complex. And complex means decisions.

For teams that already have a mature incident response process, these tools are powerful. They give you the flexibility to model complex incident response workflows with clear roles.

But for teams still figuring out their process? All that flexibility is overwhelming. You're still defining your incident commander role, building your first runbook, and establishing a blameless culture around postmortems.

The Feature Overload Problem

Some newer tools have improved the setup experience compared to legacy platforms. But teams mentioned a different issue: feature overload.

"The tool we tried is great," one EM said. "But we maybe use 20% of the features. AI postmortems, status page updates, call integrations... nice to have, but not what we actually needed. We just wanted a way to create incident channels and track what happened."

Another team had a more specific complaint: "The voice call feature is cool, but we're async-first. Nobody wants to jump on a call at 11 PM when an incident happens. We just want a Slack channel and good thread organization."

The insight: Tools often impose a specific incident response philosophy (synchronous, structured, process-heavy) that doesn't match how all teams actually work.

The "Just Works" Gap in Incident Management Tools

The pattern became clear after the fifth conversation:

"I just want incident management to work out of the box. I don't want to become an expert in incident response theory just to configure a tool. I want reasonable defaults that make sense for a team our size."

This quote is from a tech lead at a 45-person startup. But we heard variations of this repeatedly.

What does "just works" actually mean? We asked teams to be specific.

Reasonable defaults:

"If someone is primary on-call, try them first. Wait 5 minutes, then escalate to their backup. Don't make me design an escalation policy from scratch."
"Give me 3 severity levels: P0 (customer-facing), P1 (degraded), P2 (non-urgent). Don't make me define my own severity matrix." Build your matrix → Free Severity Matrix Generator
"Auto-create an incident channel with a sensible name. Post updates there. That's 90% of what we need."

Low maintenance:

"When someone joins or leaves the team, it should just update automatically from Slack/email. I don't want to maintain a separate user list."
"If our integrations break, tell me clearly what broke and how to fix it. Don't make me dig through error logs."

"Don't force me to configure everything day one"
"Let me start simple: one on-call rotation, basic alerts, Slack channels. Then when we grow, let me add more complexity. Don't force me to set up stakeholder notifications and status pages on day one; I'll add those when I need them."

One technical founder summed it up: "I want the Heroku of incident management. Just make it work. I'll customize it later if I need to."

The Alert Fatigue Myth

We expected to hear a lot about alert fatigue: too many alerts, teams ignoring notifications, etc.

And we did hear about it. But not in the way we expected.

The conventional wisdom is: "Companies have too many alerts. They need better monitoring and smarter alerting rules."

But what we heard was more nuanced.

Problem wasn't volume. It was relevance.

"We get maybe 15 alerts per day," an SRE explained. "That's not overwhelming. The problem is that 12 of them don't actually need a response. So we've learned to ignore alerts. Which means when a real incident happens, it takes us longer to notice because we're conditioned to ignore the notifications."

Another team had the opposite problem: too few alerts.

"We're worried we're under-alerting," an engineering lead said. "We've tuned our alerts to be very conservative because we don't want to wake people up for nothing. But I think we're missing real issues because we're not alerting enough."

What both teams wanted: better signal-to-noise ratio.

One team had found a creative solution: "We have two alert channels. #alerts-info for things that are off but not urgent. And #alerts-action for things that need immediate response. The key is that #alerts-action is almost always quiet. When something hits that channel, everyone knows it's real."

Simple, but apparently this took them three months of experimentation to figure out. For the industry-wide data, our State of Incident Management 2025 research found 73% of organizations had outages from ignored alerts.

Incident Coordination vs Speed: What Actually Matters

The most counterintuitive finding?

We expected teams to focus on MTTR (Mean Time to Resolution): how quickly they fix incidents.

When we asked "What matters most in your incident response?" few teams mentioned MTTR. To be clear: leaders still track MTTR as a KPI. But the engineers and on-call responders we spoke with consistently cited coordination and communication as their dominant pain point in their incident response workflow.

The most common answer? Coordination and communication.

"The technical fix is usually straightforward," a CTO noted. "The hard part is making sure everyone knows what's happening, who's working on what, and what's already been tried."

A technical lead at a 60-person company told us: "Our worst incidents aren't the ones that take longest to fix. They're the ones where communication breaks down. Three people debugging the same thing. The customer support team not knowing we're working on it. Management asking for updates every 10 minutes because they haven't heard anything."

This kept coming up: The incident itself is usually solvable. The coordination problem is harder.

After every incident, teams need effective postmortems. See our post-incident review templates with 3 ready-to-use formats (15-minute, standard, and comprehensive).

This explains why Slack-native incident management is popular. It's not that Slack is the best tool for incident management. It's that Slack is where coordination already happens.

One engineer put it perfectly: "During an incident, I need to coordinate with 5-10 people. If your tool requires me to leave Slack to manage the incident, you're adding overhead at the worst possible time. I'll just coordinate in Slack and skip your tool."

Interestingly, three teams mentioned they had more incidents after adopting a formal tool. When we dug in, it turned out they weren't creating problems. They were finally tracking incidents they'd previously ignored. The tool didn't increase incidents; it made existing problems visible. As one team put it: "We realized we were having 15-20 incidents a month, not the 5-6 we thought. We just weren't counting the ones we fixed quickly."

The On-Call Rotation Problem: When Teams Hit 40-50 People

We asked teams about their on-call setup. This was eye-opening.

Most teams didn't have formal on-call rotations.

The rest? "Whoever's around handles it."

At first this seemed dysfunctional. But when we dug into it, we found it was often intentional.

"We tried doing formal on-call," an EM shared. "It created more problems than it solved. People would wait for the on-call person instead of just fixing things. And our incidents are unpredictable. Sometimes they need the database person, sometimes the frontend person. A generic on-call rotation didn't make sense."

Their solution: "We have a #incidents channel. When something breaks, someone posts. Usually 2-3 people who are around and know that system jump in. It's informal but it works."

For teams under 40 people, this informal approach was common.

But teams over 50 people almost always had formal rotations. "You can't rely on 'whoever's around' when you're 80 people across 5 timezones," a VP of Engineering explained.

The inflection point seemed to be around 40-50 people. That's when informal coordination stops scaling.

Incident Management Best Practices: What Works at Each Stage

Based on these conversations, here's what we'd suggest:

If you're at the "single Slack channel" stage:

Don't rush to adopt a tool. If incidents are rare (< 5/month) and the team is small (< 20 people), a Slack channel is probably fine. For teams under 20 people, see our guide to reducing context switching during incidents with a 10-minute coordination framework.

But do document your incident response process. Even just a simple doc: "Here's how we handle incidents. Here's who owns what system."

If you're maintaining a janky Python script:

You're probably at the point where a proper tool makes sense. But don't just start evaluating tools randomly.

The successful teams we talked to did this first: They audited their process.

How many incidents/month are we handling?
What breaks in our current process?
Do we need formal on-call or is informal okay?
What actually matters: speed, coordination, documentation?

Then they evaluated tools based on those answers.

If you're evaluating tools:

Migrating from OpsGenie? With OpsGenie shutting down April 2027, read our complete migration guide with real timelines, pricing comparisons, and step-by-step plans from teams who've already migrated.

For general tool evaluation:

Don't just do free trials. Actually run a real incident through each tool.

Pay attention to:

Setup time - If you get frustrated during setup, your team will too
Workflow match - Does it fit how you actually work (async vs sync, lightweight vs process-heavy)?
Appropriate complexity - Is it sized right for your team, or built for a different scale?

The right tool is the one that matches YOUR workflow, not what's popular or feature-rich.

If you already have a tool but nobody uses it:

This was more common than I expected. Teams paying for tools they've abandoned.

Figure out why. Usually it's one of:

Setup was too complex (nobody finished configuring it)
It didn't match the team's workflow (tool is synchronous, team is async)
It added overhead instead of reducing it

Sometimes the answer is "switch tools." Sometimes it's "finish the setup you abandoned." Sometimes it's "go back to Slack and cancel the subscription."

For the tactical playbook, read our incident coordination guide.

Looking for Incident Management Software?

We're building Runframe based on these insights: reasonable defaults that work out of box, low maintenance overhead, lives in Slack where teams coordinate, and progressive complexity as you grow. Built for teams stuck between Python scripts and enterprise platforms (20-100 people).

We're in private beta. If you're dealing with these challenges, we'd love to hear about your setup.

Get started free or email us at hello@runframe.io

Want the next step? Read our incident coordination guide to reduce context switching, post-incident review templates that work, or our on-call rotation guide.

Scaling Incident Management FAQ

At what team size should I adopt an incident management tool?