Learn/Escalation Policy
PROCESSES

Escalation Policy

Predefined rules for when and how to escalate incidents to additional resources or management.

By Niketa Sharma, Founder at Runframe·Last updated Mar 2026
Escalation Policy

Predefined rules for when and how to escalate incidents to additional resources or management.

Call for backup

An escalation policy says: if the first person can't fix it (or is asleep), call the next person. That's it. Timeouts, channels, rotation order? All just ways to make that idea work at 3 AM when nobody is thinking straight.

Three kinds of escalation

  1. Functional: "I don't have access to the database." You call the database person.
  2. Hierarchical: "This is bigger than me." You call the manager or VP.
  3. Automatic: "Primary didn't acknowledge in 5 minutes." The system calls the secondary.

Most real incidents use a mix. You get auto-escalated to the secondary, who realizes it's a billing issue, who functionally escalates to the payments team.

Primary/secondary rotation

The setup is simple: the primary on-call gets all alerts. If they don't acknowledge within a timeout, the secondary gets paged.

Rotation order matters here. Many teams make last week's primary this week's secondary. The person who just finished primary duty knows what broke, what got deployed, and what's been flaky. When they get escalated to as secondary, they don't need a context dump. They already have the mental model.

This also helps the incoming primary. If they hit something they haven't seen before, the secondary can say "yeah, that's the thing from Thursday, here's the runbook" instead of starting from scratch.

We wrote a longer guide on how to set this up: On-Call Rotation Guide.

What a multi-tier chain looks like

Here's a typical SEV1 escalation chain:

  1. 0 min: Alert fires. Primary gets a Slack DM and push notification.
  2. 5 min: No ack. Phone call to primary.
  3. 10 min: Still nothing. Page the secondary via Slack and phone.
  4. 20 min: Neither responded. Page the engineering manager.
  5. 30 min: Page the VP or Incident Commander.

Notice each level adds a different channel. Slack didn't work? Phone call. Phone didn't work? Try the next person through a different channel. Single-channel escalation is how alerts go unnoticed for hours.

Pick your timeouts by severity

A SEV0 and a SEV3 shouldn't use the same timeout window. That seems obvious, but a lot of teams set one timeout for everything.

  • SEV0 (total outage): 5-minute timeouts. You're losing money every minute.
  • SEV1 (major degradation): 10-15 minutes. Bad, but partial functionality remains.
  • SEV2 (minor impact): 30 minutes. Needs attention, not a fire.
  • SEV3 (low impact): No auto-escalation. Handle it during business hours.

More on severity definitions: Incident Severity Levels.

The golden rule

Escalating is not a sign of weakness. It's good judgment. Better to wake up the boss than to let the site stay down for 4 hours while you try to figure it out alone.

ExThe Silent Pager

The primary on-call engineer dropped their phone in the toilet. The alert fired, but they didn't answer.

Impact
Without an escalation policy, the alert sat unacknowledged for 2 hours.
Resolution
Added a policy: If not acknowledged in 15 mins, page Secondary. If not in 30 mins, page Manager.

Why Escalation Policy Matters

Without an escalation policy, an alert sits there until someone notices. With one, the next person gets paged automatically.

On-call engineers sleep better when they know someone else will get called if they miss it.

If you rotate last week's primary into this week's secondary slot, the backup already knows what broke recently. Escalations go faster.

Common Pitfalls

Dead Ends
Escalation paths that end with a generic email address (e.g., "[email protected]"). Escalation must go to a named person with a phone number.
Single-Channel Notifications
A Slack message won't wake someone up at 3 AM. If every escalation level uses Slack, your policy has a single point of failure. Alternate channels: Slack, then phone, then SMS.
Same Timeout for Every Severity
A total outage should not wait 30 minutes to escalate. SEV0 gets 5-minute timeouts. SEV2 gets 30. Treat them differently.
Stale Contacts
People leave teams, change roles, go on vacation. If your escalation policy isn't tied to the on-call schedule, it goes stale within a month. Automate it.

How to Use Escalation Policy

⏱️
Time-Based: Set different timeouts per severity. 5 min for SEV0, 15 for SEV1, 30 for SEV2.
👥
Named People: Every level points to a person with a phone number, not a group alias or shared inbox.
📞
Mix Channels: Slack first, then phone, then SMS. If one channel fails, the next one should be different.
🔄
Rotation Order: Last week's primary becomes this week's secondary. They already know what's broken.

Frequently Asked Questions

Put this into practice.