Learn/Root Cause Analysis
ON-CALL MANAGEMENT

Root Cause Analysis

A systematic method for identifying the underlying causes of problems or incidents.

By Niketa Sharma, Founder at RunframeยทLast updated Mar 2026
Root Cause Analysis

A systematic method for identifying the underlying causes of problems or incidents.

"Digging Deeper"

Root Cause Analysis (RCA) is the detective work of SRE. It moves past "The server crashed" to "Why did the server crash?".

The "5 Whys" Technique

Ask "Why?" five times to get to the root.

  1. Why? The database locked up.
  2. Why? It ran out of connections.
  3. Why? The new "Recomendations" service leaked connections.
  4. Why? The connection pool library was outdated.
  5. Why? (Root Cause): We don't have automated dependency scanning to catch outdated libraries.

The goal is Prevention

If you fix the connection leak (symptom) but don't fix the dependency scanning (root cause), another library will break next month.

ExThe Jefferson Memorial

โ€œThe stone of the Jefferson Memorial was eroding. Why? They washed it too often. Why? Too many birds pooped on it. Why? Birds ate spiders there. Why? Spiders ate midges (bugs). Why? Midges swarmed the lights at dusk.โ€

Impact
Complex chain of causality.
Resolution
Root Cause: The lights turned on 1 hour too early. Solution: Turn lights on 1 hour later. The midges (and birds) left.

Why Root Cause Analysis Matters

Treating symptoms without finding root causes guarantees the problem will recur.

Good RCA prevents incidents from happening again and improves system reliability over time.

Common Pitfalls

Stopping at Human Error
If your RCA ends with "Engineer made a typo," you failed. Ask why the system allowed a typo to take down prod.

How to Use Root Cause Analysis

โ“
Ask 5 Whys: Dig deep to find the real cause.
๐ŸŸ
Fishbone Diagram: Visualize possible causes.
๐Ÿ“
Document Everything: Keep a record for future reference.

Frequently Asked Questions

Put this into practice.