Testing in Production, the safe way

[Author has a wealth of tools mentioned, papers to read and examples across the board]

Testing in Staging

Can’t rely solely on staging. Testing only in a staging environment:

Staging Too Different From Prod

Staging can’t imitate production well enough.

Testing in Production

Testing is prod is:

3 phases, as general concept

Phase 1: Deploy

Deploy phase means:

Deploy phase properties:

Integration Testing
Still useful to:

A service(-under-test) will be receiving test-requests from testing framework.
We’re testing that service against prod components, to which it will itself make requests e.g. read/write from DB

Stateful components receiving requests from our service-under-test:

Strategies:

Strategies with Service Mesh:

Shadowing, Dark Traffic Testing, Mirroring
Capture (or mirror) prod traffic, replay it against a deployed service.

Limitations:

Tap Compare
Same as shadowing but compare the response from the existing prod services with the one from deployed service.

Load Testing
Use a tool and appropriate monitoring to see how much the service can support.

Config Tests

Phase 2: Release

Release (or rollout) phase means:

Release phase properties:

Canarying

  1. Promote part of the cluster to the new version.
  2. Observe metrics on the new population
  3. Rollback if things go awry

Canarying has its issues

Monitoring
Challenge: identify which signals are important (3-10 signals) Examples:

Exception Tracking
Monitor requests that caused exceptions, helps in debugging.

Traffic Shaping
Slowly divert more and more traffic to the canary services.

Phase 3: Post-Release

Post-release means:

Feature Flagging or Dark Launch

A/B Testing
For tuning

Logs/Events, Metrics and Tracing
The “three pillars of Observability”

Profiling

Teeing
Like Shadowing but saving the traffic for debugging.

Chaos Engineering
Willingly cause faults in the system to assess its reaction

e.g. introduce latency, kill nodes, send fuzzed data