Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Google’s Needs for Distributed System Tracing

Services have complex dependencies on other services, and a service’s dependencies can dictate that service’s overall performance (including dependencies of dependencies).

Looking for latency or performance issues is complicated:
- multitenancy creates intermittent problems
- dependencies may change, service internals not obvious

Dapper: Ubiquitous Monitoring

Monitoring needs to be ubiquitous because:

Objectives

Instrumentation

Trace, Span, Tree
dapper_span.PNG

Trace Collection

Traces are written to disk, collected from hosts and stored in BigTable
Collection is out-of-bound to not affect network in production
Latency, from first logging to presence in Big Table:

Security

Tracing Overhead

Runtime costs:

Measured overhead:
dapper_perf.PNG

Collection Overhead

Sampling

Adaptive Sampling

Uses