“Chaos & Intuition Engineering at Netflix” by Casey Rosenthal

Control Plane at Netflix

It’s user administration, authentication, DRM negotiation (but not the streaming itself)
Entirely in the cloud, in several regions

Focus of Optimization: Performance, Fault Tolerance and Availability

Less experienced team choose one at the detriment of the others
More experienced team balances their choice

Microservice Architecture

Great for feature velocity
A microservice can often a dependency on another microservice which itself needs another, and so on

Emergence of Undesirable System-Level Behavior

Interaction of system’s components can make the system behave poorly even if each component behaves reasonably in isolation.

Imagined Example of Positive Feedback Loop

User erroneously spams ‘refresh’ on a page
Requests are buffered due to the network
Requests all hit the same server at the same time
- same server because they’re coming from the same user
Due to (erroneous) rate of requests, server fetches (possibly-stale) data from its cache instead of making network requests
- Network access wouldn’t keep up with the rate of (erroneous) requests
- Business logic dictates stale data is better than no data e.g. recommended movies, favorites list
Because requests are served from cache, CPU load goes down
Scaling service notices there’s less CPU load so kills a few servers
Remaining servers have to deal with even more load
In turn, more of the remaining servers fall back on fetching (possibly stale) data from cache
At this point, more users are noticing stale data
Some users will start refreshing, exacerbating the issue

[Without more details, this scenario sounds a bit silly; sounds like the scaling service messed up or the cache-only policy messed up]

Chaos Monkey and Chaos Kong

Chaos Monkey

Randomly turns off a prod server during employee working hours
Server failure can be noticed during working hours

Chaos Kong

Randomly turns off servers for an entire regions

Chaos Engineering (Principles of Chaos)

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production”

Build a hypothesis about steady-state behavior
Vary real-world effects
Experiment in production
- this cannot be replaced by synthetic tests
Automate experiments to run continuously
Minimize blast radius

Intuition Engineering

Idea that large complex systems need to be understood at some intuitive level

Example: Visualization Tool (Vizceral)

Tool that displays incoming requests, their destination and latency
Takes advantage of human ability for visual pattern recognition
Gives an intuitive sense of normality