Escalation Policy
Predefined rules for when and how to escalate incidents to additional resources or management.
Predefined rules for when and how to escalate incidents to additional resources or management.
Call for backup
An escalation policy says: if the first person can't fix it (or is asleep), call the next person. That's it. Timeouts, channels, rotation order? All just ways to make that idea work at 3 AM when nobody is thinking straight.
Three kinds of escalation
- Functional: "I don't have access to the database." You call the database person.
- Hierarchical: "This is bigger than me." You call the manager or VP.
- Automatic: "Primary didn't acknowledge in 5 minutes." The system calls the secondary.
Most real incidents use a mix. You get auto-escalated to the secondary, who realizes it's a billing issue, who functionally escalates to the payments team.
Primary/secondary rotation
The setup is simple: the primary on-call gets all alerts. If they don't acknowledge within a timeout, the secondary gets paged.
Rotation order matters here. Many teams make last week's primary this week's secondary. The person who just finished primary duty knows what broke, what got deployed, and what's been flaky. When they get escalated to as secondary, they don't need a context dump. They already have the mental model.
This also helps the incoming primary. If they hit something they haven't seen before, the secondary can say "yeah, that's the thing from Thursday, here's the runbook" instead of starting from scratch.
We wrote a longer guide on how to set this up: On-Call Rotation Guide.
What a multi-tier chain looks like
Here's a typical SEV1 escalation chain:
- 0 min: Alert fires. Primary gets a Slack DM and push notification.
- 5 min: No ack. Phone call to primary.
- 10 min: Still nothing. Page the secondary via Slack and phone.
- 20 min: Neither responded. Page the engineering manager.
- 30 min: Page the VP or Incident Commander.
Notice each level adds a different channel. Slack didn't work? Phone call. Phone didn't work? Try the next person through a different channel. Single-channel escalation is how alerts go unnoticed for hours.
Pick your timeouts by severity
A SEV0 and a SEV3 shouldn't use the same timeout window. That seems obvious, but a lot of teams set one timeout for everything.
- SEV0 (total outage): 5-minute timeouts. You're losing money every minute.
- SEV1 (major degradation): 10-15 minutes. Bad, but partial functionality remains.
- SEV2 (minor impact): 30 minutes. Needs attention, not a fire.
- SEV3 (low impact): No auto-escalation. Handle it during business hours.
More on severity definitions: Incident Severity Levels.
The golden rule
Escalating is not a sign of weakness. It's good judgment. Better to wake up the boss than to let the site stay down for 4 hours while you try to figure it out alone.
ExThe Silent Pager
“The primary on-call engineer dropped their phone in the toilet. The alert fired, but they didn't answer.”
Why Escalation Policy Matters
Without an escalation policy, an alert sits there until someone notices. With one, the next person gets paged automatically.
On-call engineers sleep better when they know someone else will get called if they miss it.
If you rotate last week's primary into this week's secondary slot, the backup already knows what broke recently. Escalations go faster.