Cookbook: Service Dependency & Failure Analysis
A complete walkthrough from service logs to cascade failure detection.
This cookbook demonstrates how to model microservice dependencies as a temporal graph to detect failure cascades, identify bottlenecks, and understand how incidents propagate through your infrastructure.
The Challenge
Modern distributed systems fail in complex, cascading patterns:
- Database timeout at 2:00 AM
- API gateway buffers fill at 2:02 AM
- User-facing service errors at 2:05 AM
- Load balancer health checks fail at 2:08 AM
Static monitoring sees four separate alerts. Temporal graph analysis sees one incident with a root cause.
What we'll analyze:
- Service dependency mapping
- Failure cascade reconstruction
- Latency degradation trends
- Critical path identification
The Data Model
Step 1: Generate Service Mesh Data
We'll create synthetic distributed tracing data representing a microservice architecture.
Output:
Step 2: Build the Service Dependency Graph
Step 3: Map Active Dependencies
Identify which services actually call which (not just documented dependencies).
Output:
Step 4: Detect the Failure Cascade
Find the sequence of failures that propagated through the system.
Output:
Root Cause Identified: The cascade shows order-db failed first, causing order-service to fail 2 minutes later, which then caused resource exhaustion in api-gateway affecting all downstream services.
Step 5: Analyze Latency Degradation
Detect gradual performance problems before they become outages.
Output:
Step 6: Identify Critical Paths
Find which services are most critical to overall system health.
Summary
This cookbook demonstrated a complete service dependency analysis pipeline:
| Step | What We Did |
|---|---|
| 1. Load Data | Ingested distributed tracing / service mesh logs |
| 2. Build Graph | Temporal graph of service-to-service calls |
| 3. Map Dependencies | Active call patterns (not just config) |
| 4. Detect Cascade | Traced failure propagation timeline |
| 5. Latency Trends | Identified degradation before failure |
| 6. Critical Paths | Ranked services by system criticality |
Key temporal insights:
- Cascade timeline: See exactly how failures propagate minute-by-minute
- Gradual degradation: Latency increases before the outage
- Dynamic dependencies: Runtime calls differ from architecture diagrams
Next Steps
- Platform Engineer Tutorial – Deploy monitoring at scale
- PageRank Centrality – Criticality ranking
- Temporal Windows – Point-in-time analysis