08 February 2020/ database-internals , partitioning

Partitioning in databases Part 3

This post is a continuation to the post at Database Partitioning - 2, and will be mainly focused on Strategies for partition re-balancing and request routing

Strategies for re-balancing

Hash Mod N

When partitioning by the hash of a key, we saw that it’s best to divide the possible hashes into ranges and assign each range to a partition(e.g., assign Key to partition 0 if 0<=hash(key) < b0, to partition 1 if b0<=hash(key)<b1 etc). A common thought would be, why not use mod(%) operator. For example, hash(key) mod 10 would return a number between 0 and 9 (if we write hash as a decimal number, the hash mod 10 would be the last digit). If we have 10 nodes, numbered between 0 and 9, this seems like an easy way of assigning each key to a node.

But the problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to another. For example, say hash(key) = 123456. If we initially have 10 nodes, this key starts out on node 6(because 123456 mode 10 = 6). When we grow to 11 nodes, the key needs to be moved to node 3, because (123456 mod 11 = 3), and when we grow to 12 nodes, it needs to move to node 0 (123456 % 12 = 0). Such frequent moves make rebalancing an excessively costly operation.
Fixed Number of Partitions

We could as well create many more partitions than there are nodes, and assign several partitions to each node. For example, a database running on a 10 node cluster may be split into 1000 partitions from the outset so that approximately 100 partitions are assigned to each node. Now, if a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. This process is shown below

Fixed Partitioning

If a node gets removed from the cluster, the same happens in reverse. Only entire partitions are moved between nodes The number of partitions does not change, nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes. This change of assignment is not immediate, it takes some time to transfer a large amount of data over the network. So the old assignment of partitions is used for any reads and writes that happen while the transfer is in progress. In Some cases we can also account for the mismatched hardware in the cluster: by assigning more partitions to nodes that are more powerful, we can force these nodes to take a greater share of load. This approach is used in Riak, Elasticsearch, Couchbase.

In this configuration, the number of partitions is usually fixed when the database is first set up and not changed afterward. Although in principle it’s possible to split and merge partitions, a fixed number of partitions is operationally simpler, and so many fixed-partition databases choose not to implement partition splitting.

Choosing the right number of partitions is difficult if the total size of the dataset is highly variable(for example, if it starts small but may grow much larger over time). Since each partition contains a fixed fraction of the total data, the size of each partition grows proportionally to the total amount of data in the cluster. If the partitions are very large, re-balancing and recovery from node failures become expensive. But if partitions are too small, they incur too much maintenance overhead, so its counter productive to choose a high number.
Dynamic partitioning

For databases that use key range partitioning, a fixed number of partitions with fixed boundaries would be very inconvenient, if we get the boundaries wrong, we could end up with all the data in one partition and all the other other partitions being empty. Reconfiguring the partition boundaries manually is a very tedious task.

Thus we could also have a pre configured size, based on which the partition gets split, i.e when the partition’s size grows to exceed this threshold, it is split into two partitions so that approximately half of the data ends up on each side of the split. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This is similar to what happens at the top level of a B-Tree.

Each partition is assigned to one node, and each node can handle multiple partitions, like in the case of a fixed number of partitions. After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load. The main advantage here is that the number of partitions adapt to the total data volume.

If there is only small amount of data, a small number of partitions are sufficient, so overhead is small, if the data is huge in volume, the size of the individual partition is limited to a configurable maximum.

However, there’s a caveat in this approach, the empty database starts of with a single partition, since there is no prior information about where to draw the partition boundaries. While the dataset is small, until we hit the point at which the first partition is split, all the writes have to be processed by a single node while the others sit idle.
Partitioning proportionally to nodes

With dynamic partitioning, the number of partitions is proportional to the size of the dataset, since the splitting and merging process keep the size of each partition between some fixed minimum and maximum. On the other hand, with a fixed number of partitions, the size of each partition also is proportional to the size of the dataset. In both these cases, the number of partitions is independent of the number of nodes.

Another option, is to make number of partitions proportional to the number of nodes. In other words, to have a fixed number of partitions per node. In this case, the size of each partition grows proportionally to the data-set size while the number of nodes remain unchanged, but when we increase the number of nodes, the partitions become smaller again. Since a larger data volume generally requires larger number of nodes to store, this approach also keeps the size of each partition fairly stable.

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of oen half of each of those split partitions while leaving the other half of each partition in place. This randomization can produce unfair splits, but when averaged over a larger number of partitions, the new node ends up taking a fair share of load from the existing nodes.

Request Routing

Now that we’ve seen how to partition our dataset, there’s another open question: When a client makes a request, how does it know which node to connect to? As partitions are re-balanced, the assignment of partitions to nodes changes. Somebody needs to stay on top of those changes in order to answer the question: If i want to read or write key foo, which IP address and port number(process identifier) do i need to connect to ?

The above question is an instance of a more General problem called Service discovery, which isn’t limited only to databases. Any software that is accessible over a network has this problem, especially if it is aiming for High availability.

Following are the approaches to solve this problem

Allow clients to contact any node(e.g, via a round-robin load balancer). If that node coincidentally owns the partition to which the request applies, it can handle the request directly; otherwise, it forwards the request to the appropriate node, receives the reply, and passes the reply along to the client.
Send all requests from client to a routing tier first, which determines the node that should handle each request and forwards it accordingly. This routing tier does not itself handle any requests, it only acts as a partition aware load balancer. *Require that clients be aware of the partitioning and the assignment of partitions to nodes. In this case, a client can connect directly to the appropriate node, without any intermediary.

Different ways of request routing

In all the above approaches, the key question is: how does the component making the routing decision(which may be one of the nodes, or the routing tier, or the client) learn about changes in the assignment of partitions to nodes ?

This is a challenging problem, because it is important that all participants agree, otherwise requests would be sent to wrong nodes and not handled correctly. There are protocols for achieving consensus in a distributed system, but they are hard to implement correctly(challenge accepted :P).

Many distributed data systems rely on a separate co-ordination service such as zookeeper to keep track of this cluster metadata, as shown below ![Zookeeper keeping track of assignment of partition to nodes] Each node registers itself in zookeeper, and zookeeper maintains the authoritative mapping of partitions to nodes. Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this information in zookeeper. Whenever a partition changes ownership, or a node is added or removed, zookeeper notifies the routing tier so that it can keep its routing information up to date.

Examples: Kafka, SolrCloud, HBase use ZooKeeper to track partition assignment.

Partitioning in databases Part 3

Importance of timeouts

Partitioning in databases part 2