“A Frontend Server, Front to Back” by Zach Tellman
Story of a front-end server aka a gateway server, a routing server
For Fitbit, needed because a monolith needed to be broken up [i.e. the way microservices need routing]
Fitbit Frontend Server Responsibilities
- Route traffic
- Validate traffic
- Shape traffic
4 Important Properties (in order of importance-ish)
- Transparent
- extensive, quality metrics
- easy to reason about internal workings
- Stable
- not crash
- robust to malicious requests
- robust to high traffic (e.g. shed traffic)
- Fast
- minimal overhead
- predictable performance
- mitigate slow backend
- Extensible
- small, easy to understand, easy to change, adapt
- yet change should make it hard to violate other 3 properties
- i.e. “indicate how it’s meant to be changed”
System Design: Looking at the Extremes
Consider:
- What extremes should the system handle e.g. peak req/sec
- What should the system do when going above/under the extremes
Frontend Uses Async: Issues
- Faster, more throughput
- Less transparent
- no immediately obvious stack traces
- no easy-to-reason about single thread of execution
- Less extensible
- callbacks everywhere, less obvious for newcomers to reason about
Adding Transparency: State Machine & Passports
- Possible code paths are represented on a FSM-like diagram
- Code stamps a passport-object that maps to the diagram
Requests in Queues and Handling: Latency, Throughput, Resource Usage
Somewhat conflicting possible optimizations: - resource usage - make sure no resource is idle, min resources - latency per request - all requests get served immediately - throughput - volume but not necessarily low latency though
Load Balancing Techniques
Goal: - Minimize latency - Distribute work - Deal with workers that are unpredictably slow
- Round Robin:
- simple, predictable
- slow workers gets as many requests as fast workers
- amplifies slowdowns
- Least In-Flight:
- keeps track of queued requests per worker
- least busy gets next request
- more complex, less transparent, more overhead
- great: slow worker gets few requests
- bad: fast worker can just mean rapid erroring out
- amplifies requests erroring out as all traffic is sent to erroring out worker
- keeps track of queued requests per worker
- Least In-Correct-Flight
- also keep track of worker errors
- don’t entirely starve erroring servers
Important Metrics
"You improve what you measure"
Or
"Everything you ignore get worse"
- p999 Overhead
- introduced by frontend server e.g. GC, computing routing
- request rate
- error rate
Lessons Learned
- Articulate goals and prioritize
- Model extremes of the system, decide on behavior
- Choose metrics