Dynamo: Amazon’s Highly Available Key-Value Store

Amazon Services DB Needs
Dynamo presents the idea of a non-relational database tailored to the needs of Amazon services. Amazon observed that the traditional relational databases present the wrong set of features for too high a price.
Many Amazon services need to frequently store and retrieve small amounts data associated with an identity; they have no need for a complex schema or of many of the features a relational database offers, they simply want a key-value store.
Furthermore, the availability of the key-value store is more important to these services than its consistency guarantees, which departs again from traditional database expectations.
Lastly scalability, also of high priority, cannot be easily achieved with traditional relational databases. This is due to the guarantees and large feature sets that relational databases possess but that do not fit the needs of Amazon’s services.

Using traditional databases to fit these needs would be convoluted (and so prone to errors), expensive to operate and would probably still not meet the availability and scalability demands.

The authors of the paper explore the importance of Service Level Agreements (SLA) that Dynamo intends to meet. Indeed, Amazon has stringent performance expectations of their services and, consequently, of the underlying storage.
It seems that all other distributed key-value stores have chosen tradeoffs that either made attaining the SLAs challenging or encumbered the key-value store with features that add unneeded complexity to the system.
For example, Dynamo would be running in a trusted environment, which means that systems considering authorization, authentication or Byzantine failures would be doing too much. Dynamo would not need the (relatively) complex storage schema offered by Google’s BigTable nor would it need to ensure complex data reconciliation when consistency eventually becomes an issue.

We can say that Dynamo’s goals are specialized: key-value store with high performance, high availability and scalability. To attain them, the designers were willing to make tradeoffs with consistency, forego many database features and shift more responsibility to the users of the store.

Usage
Dynamo has a simple interface with put() and get() operations.
The value is treated as a blob, and the key is hashed to create a 128-bit ID.

The get() operation is a read operation that returns data but also a context.
The context carries the version info (as vector clocks) of the data being retrieved. When using put() to update the value of a given key, put() expects the context the user was given from the last get().

Data Reconciliation
The version info in the context, carried by the user between gets and puts, is used to allow for syntactical reconciliation; essentially, if a causal relationship can be inferred between data stored on different nodes, the system can decide which data is truly canonical and discard the other.
This syntactical reconciliation happens on a get operation.

The most interesting responsibility shift is the more complex semantic reconciliation: if Dynamo cannot use the context/version info to perform syntactical reconciliation, it asks the user to do semantic reconciliation.
Thus, when performing a get, a user may receive several versions of the same data; it’s up to the user to reconcile the conflicting data versions. This makes sense, given the business logic would know best how to best reconcile imperfectly consistent data in the least ‘damaging’ way.
The example given for semantic reconciliation is a shopping cart, where a get operation returns two shopping carts.
The user’s semantic reconciliation strategy can simply be to add missing items from one cart to the other; this strategy makes sense for shopping cart reconciliation but may not make sense for other business cases.

Dynamo prefers perform to attempt data reconciliation on read operations, making the system write friendly.

Partitioning
Data is partitioned and replicated among the cluster nodes using consistent hashing in a ring formation, as explained in detail in the Chord paper.
A read or write operation for a particular key is handled by a coordinator. For write operations, the coordinator is one node from a set of nodes designated as “responsible” for a particular key (the coordinators are derived from hashes of the nodes’ identity and a hash of the key). For read operations, any node can be a coordinator.
On a read or write request, the coordinator talks to other nodes “responsible” for the key, either reading from several of them (in case one of the nodes has a more recent version) or asking them to write the data for the key, and only reporting back success to the caller once enough nodes have confirmed to the coordinator that they have written the data locally.

If a node, not “responsible” for a key, receives a write request, it will not act as the coordinator and instead forward the request to one of the healthy “responsible” nodes for that key; one of those will be the coordinator for that write request.

High Availability
Dynamo achieves high availability by making all nodes symmetrical (e.g. no special master node, all nodes are coordinators for some set of keys), diminishing points of failure.
It also attempts to replicate data across diverse hosts located in different data centers, again diminishing points of failure.
While each key(-value pair) is assigned to certain set of “responsible” nodes, it’s possible that one of these preferred nodes are unavailable. In such a case, other less preferable nodes handle a write operation, writing the data locally, on behalf of the missing preferred node. They then periodically check whether the preferred nodes have become available again so as to transfer to them the data they missed while unavailable (hinted handoff).

Tuning
An interesting choice is to allow the user to configure the N, R and W parameters, where
N: total number of nodes that should store copies of the data
W: minimum number of nodes that report a successful write before a put() operation is considered completed
R: minimum number of nodes that return data for a given key before the result is returned to the user

For W, the coordinator still tries to eventually write the data to all N responsible nodes for a given key, but for performance reasons, the user might be happy to know only a few nodes (only W nodes) have written the data and that, eventually, later, all N nodes will write the data.

For R, the coordinator queries all N nodes for data and is satisfied to return the data it’s gotten so far when it’s received R responses. This is when different versions of the same data is returned to the user, if the system isn’t able to perform syntactic reconciliation.

This means that the user can evaluate and decide on the importance of availability, performance and durability. Setting W to 1 will return rapidly but increases the chances data loss if, for example, the single node fails catastrophically after confirming the write but before the data is replicated to other nodes. Similarly, setting R to 1 also returns rapidly but increases the change of getting stale data.

Common (N, R, W) values are (3,2,2)

Adding/Removing Nodes
Dynamo does not automatically add/remove nodes from the cluster: adding/removing requires human intervention.
Nodes use a gossip protocol to communicate membership changes as well as communicate which key(-value pairs) a particular node is responsible for.

High Performance Tricks

Writing buffer: a write is considered successful once it reaches main memory (and is scheduled to be written to disk). The coordinator does asks one of the N nodes to perform a durable on-disk write before that node considers the write successful. This durable write does not affect performance because generally W < N so that W nodes will quickly report the write complete to the coordinator while the last node will churn along actually writing to disk; the coordinator doesn’t need to wait after that last node before reporting the put operation as being completed to the user.
Uniform load distribution: Dynamo uses different partitioning strategies to ensure uniform load across all nodes, at the cost of more coordination when adding/removing nodes.
Writing to the fastest reader: the context token returned from a get operation also records which node had the fastest response time for that operation. This fastest node can then be used as the write coordinator for any upcoming put operation. Hopefully, the node will once more be the fastest to be reached for the put operation and might also be read from for any subsequent get operations (thus, the user will read their latest write).
Cluster-aware clients: If the Dynamo clients are aware of the nodes in the cluster, aware of the data partition/key distribution among the nodes, and regularly update their view of the system, requests don’t need to go through a load-balancer. Indeed, client requests can directly go to the appropriate coordinator for reads or writes, skipping at least once hop.
Heterogeneous hardware: the nodes can take on requests in proportion to their hardware capabilities; not requiring homogeneous hardware means hardware resources aren’t wasted if more powerful machines are added to the cluster, and updates of cluster nodes can be performed progressively.
Zero-hop distributed hash table: (unlike Chord) each node maintains routing information to reach any other node directly; more data to maintain but smaller latency as less hops are needed.
Foreground operations priority: foreground tasks (put/get operations) take precedence over background tasks (updating membership with neighbors, hinted handoff). A monitoring process on each node throttles background tasks based on resource utilization on the machine.