Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Google’s Needs for Distributed System Tracing

Services have complex dependencies on other services, and a service’s dependencies can dictate that service’s overall performance (including dependencies of dependencies).

Looking for latency or performance issues is complicated:
- multitenancy creates intermittent problems
- dependencies may change, service internals not obvious

Dapper: Ubiquitous Monitoring

Monitoring needs to be ubiquitous because:

performance issues can occur in unexpected or edge parts of the system
because some issues cannot be easily reproduced

Objectives

low overhead
- marginal overhead so that there’s no incentive to turn it off
- small codebase ~2k LOC
- uses sampling
application-level transparency
- developers should not be aware of the system
- need for continued developer care to maintain the system makes it fragile
scalability
trace results quickly available
- faster reaction to production anomalies

Instrumentation

Trace, Span, Tree

All Google services use common core libraries for:
- communication (RPC, async callbacks)
- threading, thread pools
Dapper is integrated in this common core to trace inter-component communication and intra-component work
- some applications still use Dapper-unsupported communication e.r. raw sockets, SOAP
Dapper allows developers to add annotations (textual, key-value pairs)
Annotation sizes can be capped

Trace Collection

Traces are written to disk, collected from hosts and stored in BigTable
Collection is out-of-bound to not affect network in production
Latency, from first logging to presence in Big Table:

Median: 15 seconds
98th percentile:
- 75% of time, <2 minutes
- 25% hours, hours

Security

RPC/callback payloads are not automatically logged; opt-in only
Dapper monitors communication endpoints for incorrect encryption, or unauthorized inter-component communication

Tracing Overhead

Runtime costs:

span creation ~200 ns
skipped annotation ~9 ns
average annotation ~4 0ns

Measured overhead:

Collection Overhead

Low CPU use < 1%
Low network usage < 0.01%
Average 426 bytes / span

Sampling

sampling of 1/16 is barely measurable
yet sampling of 1/1024 is enough data for meaningful data

Adaptive Sampling

sampling per unit of time, not per request volume
- e.g. #traces/second instead of #traces/requests
i.e. low traffic apps will increase the #traces/requests, high traffic will decrease #traces/request
an error in high-throughput services will occur thousands of times (and be caught by a trace)
an error in low-throughput services will occur more rarely so more aggressive tracing is needed

Uses

Integration with exception handling
Inferring dependencies
API used for both tools and human inspection