Cookbook: Service Dependency & Failure Analysis

A complete walkthrough from service logs to cascade failure detection.

This cookbook demonstrates how to model microservice dependencies as a temporal graph to detect failure cascades, identify bottlenecks, and understand how incidents propagate through your infrastructure.

The Challenge

Modern distributed systems fail in complex, cascading patterns:

Database timeout at 2:00 AM
API gateway buffers fill at 2:02 AM
User-facing service errors at 2:05 AM
Load balancer health checks fail at 2:08 AM

Static monitoring sees four separate alerts. Temporal graph analysis sees one incident with a root cause.

What we'll analyze:

Service dependency mapping
Failure cascade reconstruction
Latency degradation trends
Critical path identification

The Data Model

Loading graph...

Step 1: Generate Service Mesh Data

We'll create synthetic distributed tracing data representing a microservice architecture.

Output:

Step 2: Build the Service Dependency Graph

Step 3: Map Active Dependencies

Identify which services actually call which (not just documented dependencies).

Output:

Step 4: Detect the Failure Cascade

Find the sequence of failures that propagated through the system.

Output:

Root Cause Identified: The cascade shows order-db failed first, causing order-service to fail 2 minutes later, which then caused resource exhaustion in api-gateway affecting all downstream services.

Step 5: Analyze Latency Degradation

Detect gradual performance problems before they become outages.

Output:

Step 6: Identify Critical Paths

Find which services are most critical to overall system health.

Summary

This cookbook demonstrated a complete service dependency analysis pipeline:

Step	What We Did
1. Load Data	Ingested distributed tracing / service mesh logs
2. Build Graph	Temporal graph of service-to-service calls
3. Map Dependencies	Active call patterns (not just config)
4. Detect Cascade	Traced failure propagation timeline
5. Latency Trends	Identified degradation before failure
6. Critical Paths	Ranked services by system criticality

Key temporal insights:

Cascade timeline: See exactly how failures propagate minute-by-minute
Gradual degradation: Latency increases before the outage
Dynamic dependencies: Runtime calls differ from architecture diagrams

Next Steps

Platform Engineer Tutorial – Deploy monitoring at scale
PageRank Centrality – Criticality ranking
Temporal Windows – Point-in-time analysis