“Designing for Performance” by Martin Thompson
What is Performance?
Many aspects:
- throughput
- service time (time for processing an item)
- latency, response time (service time + wait time in a queue)
- scalability (touches on performance)
These aspects are not independent e.g. increasing throughput can cause latency to increase (which seemed fine at low throughput)
Queueing Theory
System utilization vs response time is not linear:
- We can reduce utilization if we reduce service time
- We need to make sure there’s capacity so as to not reach above that 70% too often
Parallel Speedup
Parallel can help but don’t forget Amdahl’s law; how much of a request can be processed in parallel?
- 50% parallelizable task: max 2x speedup
- 95% parallelizable task: max 20x speedup
Universal Scalability Law
Amdahl didn’t take into account the sync overhead of parallel tasks (sharing common world view): the coherence penalty.
Coherence penalty increases as parallelism increases (e.g. more CPU, threads, machines).
At some point, the coherence penalty eats away at a good portion of the benefit.
AWS experiment with a normal 150 microsecond coherence penalty:
Penalties of Logging Frameworks
They usually cannot do any work in parallel.
Clean & Representative
Anecdotally, clean code tends to also be performant
Abstractions: Create Them When Certain of the Benefits
- Big frameworks generally don’t deliver as much as they cost in terms of performance and understanding the code
- Abstractions have a cost, they need to make sense and pay off, not just be convenient at the moment [pay off in sense of maintainable code?]
- Avoid building abstractions too early: find at least 3 things that would fit the abstraction
- Too-early abstractions can be bad abstractions: they can force new code to adhere to an abstraction that doesn’t fit the problem
Leaky Abstractions
- Abstractions should not be leaky but instead:
- “… the purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise (Dijkstra)
Memory System Abstraction
Hardware perf:
- accessing main memory: 65-100ns
- accessing L1 cache: ~1ns
Hardware caching is counting on software behavior:
- temporal: code/data will be reused in near future (e.g. loops)
- spatial: things that work together will be near each other
- e.g. fields in a class are cohesive and therefore used in computations alongside each other; we’ll cache the entire object so as to increase cache hits
- patterns of access: pre-fetch data based on access pattern assumptions
Performance Ideas
Coupling & Cohesion
- bad cohesion: feature envy
- one class accesses members of another: maybe those member should be moved
- fields should be where they are used
- allows spatial, temporal optimization
- anecdotally, 30% performance increase
Relationships
- Use the correct data structures for the correct behavior
- Lists and maps are not enough
Batching
Amortize the expensive costs using batches:
- when doing something expensive, batch things to amortize
- e.g. writing logs to disk: allow a batcher to write several at once instead of one disk access per log line
Branches
- Branches hurt performance, even speculative execution can’t save us
- Some unneeded branches are there because the code isn’t clean
Loops
- x86/x64 instructions are decoded into smaller micro-ops in the CPU
- Decoding instructions is (relatively) expensive
- Best to stay inside loops whose decoded form can fit inside the L0 cache (micro-ops cache)
- i.e. keep loops small, elegant, few branches
Composition: Size Matters
- Prefer smaller, focused things
- Building things with single responsibility: easy to reuse in other contexts
- More likely to inline
APIs
- Give the caller the choice: don’t impose a specific pattern
- The caller can find optimizations that suit her
Data Organization
- Consider replacing array of objects (spread all over the heap) with a collection of arrays
- Each array representing a field
- Allows vectorization, memory optimizations
- Embrace Set Theory
- Embrace Functional Programming
Performance Testing
- Needs numeric targets, not just “faster”
- e.g. specific throughput
- e.g. specific latency at specific throughput
- Know the limits of the software
- e.g. what throughput makes latency unacceptable
Measuring Response Time
- Averages are not sufficient
- Histograms, quantile, percentile: much better
Measure the Right Things
- Bad idea: measuring the service time but not wait time
- Coordinated omission
Benchmarking
- CPU performance counters
- exposes cache misses, branch misses
- Run benchmarks as part of continuous integration
Measuring Running System
- Telemetry should be built into production system
- Counters of: queue lengths, concurrent users, exceptions, transactions, etc.
- Histograms of: response times, service times, queue lengths, concurrent users, etc.
- See: Aeron system
“It does not matter how intelligent you are, if you guess and that guess cannot be backed by experimental evidence - then it is still a guess” - Feynman