Distributed Systems Observability

Need for Observability

Applications move to cloud environments with modern concepts:

Applications becoming distributed Classic monitoring workflows and techniques no longer work

New failures modes:

What is Observability

observability_testing_monitoring

Property of a system provides:

“In its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of the following facts:

Observability Signals

Performance Considerations

Observability Considerations

Observability vs Monitoring

Observability is a superset of monitoring

Monitoring

Blackbox Monitoring: mostly detect the symptoms, no insight into internal system state, no insight into the cause
Whitebox Monitoring: insight into the internal system state

Alerting
Alerting scope has shrunk:

Alerts should link to monitoring data to:

Alerts need to be actionable

Monitoring Signals for Alerting

Debugging Failures

Navigating Monitored Signals
“Dire need for higher-level abstractions (such as good visualization tooling) to make sense of the mountain of disparate data points from various sources cannot be overstated”

Exposing System Information
What data to expose to monitoring/observability?
How to examine and interpret the data?
Requires:

Coding and Testing for Observability

Coding for Failure
Systems will fail: debugging them is paramount

Operational Characteristics of the Application
Devs cannot ignore operational details as they influence perf, failure
Understanding:

Examples:

Debuggable Code
Study pros and cons of instrumentation and choose one that fits the whole picture, code, dependencies, infrastructure dependencies, etc.

Testing for Failure

Three Pillars of Observability

Event Logs

Pros and Cons

Logging Performance

Logging as a Stream Processing Problem

Metrics

Pros and Cons

Tracing
“A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system”

Pros and Cons

Service Meshes