“Queueing Theory in Practice: Performance Modeling for the Working Engineer” by Eben Freeman

Queueing Theory helps us understand our systems:

Models:

Modeling Serial systems

System in question: Honeycomb ingestion API

How to allocate appropriate resources for a service (e.g. on AWS) keeping costs minimal?

Small Experiment

Find the max request throughput for single-core (with a queue):

Experiment Data

Everything is smooth until it’s not:

performance_modeling_working_engineer_data.png

What’s the model that describes this?

Step 1: identify the question

Step 2: identify the assumptions

Step 3: model the system
At 6:13, he shows that based on the assumptions, this describes the throughput vs wait time:

performance_modeling_working_engineer_model.png

Model fits the curve well:

performance_modeling_working_engineer_fit.png

Improving service time has non-linear effects Model tells us halving service time and doubling throughput still results in a faster system. S is service time, lambda is throughput, W is wait:

performance_modeling_working_engineer_service_throughput.png

Variability is bad
Variability in request arrival time and request processing time causes the queueing

What we can do:

Modeling Parallel Systems

With multiple machines, we now need to decide: how to dispatch incoming requests

Optimal Assignemnt: Coordination is Expensive

Dispatching work requires a coordinators e.g. load-balancer

performance_modeling_working_engineer_coordinator.png

The coordinator needs to verify the load on each possible server

High Parallesim and Coordination

Paper to read: [The Power of Two Choices in Randomized Load Balancing] (https://www.eecs.harvard.edu/~michaelm/postscripts/tpds2001.pdf)
Systems to read about: Kraken, Scuba, Sparrow

Approximate Optimal Assignment
Instead of querying N servers for, query 2 randomly. Pick best of the pair.

Iterative Partitioning