When should I escalate an incident?

Three situations: the primary on-call hasn't acknowledged within the timeout window, you're stuck and don't have the access or knowledge to fix it, or the severity just got worse (a SEV2 turned into a SEV1 because more customers are affected). When in doubt, escalate. Nobody has ever been fired for paging backup too early.

What is the difference between functional and hierarchical escalation?

Functional is horizontal: you pull in someone with different expertise, like a database specialist. Hierarchical is vertical: you pull in someone with more authority, like a manager or VP, because the blast radius is too big for you to own alone. Most real incidents involve both.

Why should last week's primary be this week's secondary?

Because they just spent a week watching the system. They know what got deployed, what broke, and what's been flaky. When they get paged as secondary, they can help without a context dump. The alternative is a secondary who has to ask "wait, what changed?" while the site is down.

How many levels should an escalation policy have?

Three or four: primary, secondary, engineering manager, VP or Incident Commander. If you need more than four levels, your timeouts are probably too short or you need to split into per-service escalation chains.

Learn/Escalation Policy

PROCESSES

Escalation Policy

Predefined rules for when and how to escalate incidents to additional resources or management.

By Niketa Sharma, Founder at Runframe·Last updated Mar 2026

Escalation Policy

Predefined rules for when and how to escalate incidents to additional resources or management.

On-Call Rotation: Schedules, Handoffs & Templates Incident Severity Levels: How to Define SEV0-SEV3 Incident Response Playbook: Step-by-Step Guide

Call for backup

An escalation policy says: if the first person can't fix it (or is asleep), call the next person. That's it. Timeouts, channels, rotation order? All just ways to make that idea work at 3 AM when nobody is thinking straight.

Three kinds of escalation

Functional: "I don't have access to the database." You call the database person.
Hierarchical: "This is bigger than me." You call the manager or VP.
Automatic: "Primary didn't acknowledge in 5 minutes." The system calls the secondary.

Most real incidents use a mix. You get auto-escalated to the secondary, who realizes it's a billing issue, who functionally escalates to the payments team.

Primary/secondary rotation

The setup is simple: the primary on-call gets all alerts. If they don't acknowledge within a timeout, the secondary gets paged.

Rotation order matters here. Many teams make last week's primary this week's secondary. The person who just finished primary duty knows what broke, what got deployed, and what's been flaky. When they get escalated to as secondary, they don't need a context dump. They already have the mental model.

This also helps the incoming primary. If they hit something they haven't seen before, the secondary can say "yeah, that's the thing from Thursday, here's the runbook" instead of starting from scratch.

We wrote a longer guide on how to set this up: On-Call Rotation Guide.

What a multi-tier chain looks like

Here's a typical SEV1 escalation chain:

0 min: Alert fires. Primary gets a Slack DM and push notification.
5 min: No ack. Phone call to primary.
10 min: Still nothing. Page the secondary via Slack and phone.
20 min: Neither responded. Page the engineering manager.
30 min: Page the VP or Incident Commander.

Notice each level adds a different channel. Slack didn't work? Phone call. Phone didn't work? Try the next person through a different channel. Single-channel escalation is how alerts go unnoticed for hours.

Pick your timeouts by severity

A SEV0 and a SEV3 shouldn't use the same timeout window. That seems obvious, but a lot of teams set one timeout for everything.

SEV0 (total outage): 5-minute timeouts. You're losing money every minute.
SEV1 (major degradation): 10-15 minutes. Bad, but partial functionality remains.
SEV2 (minor impact): 30 minutes. Needs attention, not a fire.
SEV3 (low impact): No auto-escalation. Handle it during business hours.

More on severity definitions: Incident Severity Levels.

The golden rule

Escalating is not a sign of weakness. It's good judgment. Better to wake up the boss than to let the site stay down for 4 hours while you try to figure it out alone.

ExThe Silent Pager

“The primary on-call engineer dropped their phone in the toilet. The alert fired, but they didn't answer.”

Impact

Without an escalation policy, the alert sat unacknowledged for 2 hours.

Resolution

Added a policy: If not acknowledged in 15 mins, page Secondary. If not in 30 mins, page Manager.

Why Escalation Policy Matters

Without an escalation policy, an alert sits there until someone notices. With one, the next person gets paged automatically.

On-call engineers sleep better when they know someone else will get called if they miss it.

If you rotate last week's primary into this week's secondary slot, the backup already knows what broke recently. Escalations go faster.

Common Pitfalls

Dead Ends

Escalation paths that end with a generic email address (e.g., "[email protected]"). Escalation must go to a named person with a phone number.

Single-Channel Notifications

A Slack message won't wake someone up at 3 AM. If every escalation level uses Slack, your policy has a single point of failure. Alternate channels: Slack, then phone, then SMS.

Same Timeout for Every Severity

A total outage should not wait 30 minutes to escalate. SEV0 gets 5-minute timeouts. SEV2 gets 30. Treat them differently.

Stale Contacts

People leave teams, change roles, go on vacation. If your escalation policy isn't tied to the on-call schedule, it goes stale within a month. Automate it.

How to Use Escalation Policy

⏱️

Time-Based: Set different timeouts per severity. 5 min for SEV0, 15 for SEV1, 30 for SEV2.

👥

Named People: Every level points to a person with a phone number, not a group alias or shared inbox.

📞

Mix Channels: Slack first, then phone, then SMS. If one channel fails, the next one should be different.

🔄

Rotation Order: Last week's primary becomes this week's secondary. They already know what's broken.

Related Terms

On-Call Rotation Incident Commander

Frequently Asked Questions

Learn More

State of Incident Management 2025

Industry benchmarks, Escalation Policy trends, and insights from engineering teams.

Free SRE Tools

Calculators, generators, and utilities for incident management teams.

Put this into practice.

Start Free Explore Tools