Partitioning in databases part 1
Generally in databases Replication ensures that availability is achieved. But this is not always the case, there are production systems which have a very large dataset or having a very high query throughput, for achieving query throughput reads can be directed towards replicas which are closer to the user who’s initiating the request, but this alone doesn’t scale if the size of the data is large. This is where Partitioning/Sharding of the data comes in to picture.
The term Partition is polymorphic, i.e its called as the following across various vendors:
1. Shard - MongoDB, Elasticsearch, Solr
2. Region - HBase
3. Tablet - Bigtable
4. vnode - Cassandra, Riak
4. vbucket - Couchbase
Usually partitions are defined in such a way that each piece of data(record/row/document) belongs to exactly one partition. Each partition is a small database of its own, even though there might be operations where the database has to touch multiple partitions at the same time. The main reason Partitions are present in the first place is to increase Scalability. Different partitions can be placed on different nodes in a shared-nothing cluster. Thus a large dataset can be distributed across many disks, and the query load can be distributed across many processors.
For Queries which operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes. Large, complex queries can be parallelized across many nodes.
Partitioning is usually combined with replication, so that copies of the each partition are store on multiple nodes. This means that, evn though each record belongs to exactly one partition, it may still be stored on several different nodes for fault tolerance. A node may store more than one partition. If a Leader-Follower replication mode is used, then combination of partitioning and replication can look like
In the above scenario, each node acts as a leader for some partitions and follower for others. Everything pertaining to replication in databases applies equally to replication in partitions. If you haven’t read my posts at Replication, please do.
Say we have a large amount of data, and we want to partition it, how do we decide which records to store on which nodes? Our Goal with partitioning is to spread the data and the query load evenly across nodes. If every node takes a fair share, then in theory - 10 nodes should be able to handle 10 times as much data and 10 times the read and write throughput of a single node.
If the partitioning scheme is unfair, so that some partitions end up with more data or queries than others, it is termed to be skewed. The presence of skew defeats the whole purpose of partitioning because it becomes less effective. In an extreme case, all the load would end up on one partition, so that if we have a cluster of 10 nodes, 9 among them will be idle and the busy node will be a bottleneck(hot-spot).
The simplest way to avoid hot spots would be to assign records to nodes randomly. This would distribute the data quite evenly across the nodes, but the disadvantage with this approach is that when the record is being queried we will not know which partition is holding on to this record. Hence we would have to query all partitions in parallel.
Let’s assume that we have a simple key-value data model, in which we always access a record by its primary key. One way to partition here is to assign a continuous range of keys(from some minimum to maximum) to each partition, like the volumes of a paper encyclopedia like
If we know the boundaries between the ranges, we can easily determine which partition contains the given key, thus we can make the request to the appropriate node directly. This scheme again need not necessarily be evenly spaced, because the data may not be evenly distributed. In order to distribute the data evenly, the partitioning scheme should adapt according to the data. Within each partition we can keep the keys in sorted order. This gives us the advantage that range scans are easy, and we can treat the key as a concatenated index in order to fetch several related records in one query.
Partitioning by hash of a key
Because of the skew risk and hot spots, many distributed data-stores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed. Say we have a 32-bit hash function that takes a string. Whenever we give it a new string, it returns a seemingly random number between 0 and 2^32 -1. Even if the input strings are very similar, their hashes are evenly distributed across that range of numbers. Hence for partitioning purpose, the hash function need not be cryptographically strong, for example Cassandra and MongoDb use MD5. Once we have a suitable hash function for keys, we can assign each partition a range of hashes(rather than a range of keys), and every key whose hash falls within a partition’s range will be shared in that partition as shown below
However, in the above scheme as well, we lose the functionality of key-range partitioning: the ability to do efficient range queries. Keys that were once adjacent are now scattered across all the partitions, so their sort order is lost.
Cassandra acheives a compromise between the two partitioning strategies above. A table in cassandra can be declared
with a Compound primary key consisting of several columns. Only the first part of that key is hashed to determine the
partition, but the other columns are used as a concatenated index for sorting the data in SSTables. A query therefore
cannot search for a range of values within the first column of the compound key, but if it specifies a fixed value for
the first column, it can perform an efficient rane scan over the other columns of the key.
This concatenated index approach will enable us to have an elegant data model for one-many relationships. For example, on a social media site, one user may post many updates. If the primary key for updates is chosen to be(userid, updatetimestamp), then we can efficiently retrieve all the updates made by a particular user within some time interval, sorted by timestamp. Different users may be stored on different partitions, but within each user, the updates are stored ordered by the timestamp on a single partition.
Hashing a key to determine its partition can help reduce hot spots, however we cannot avoid them entirely. In an extreme case where all reads and writes are for the same key, we still end up with all requests being routed to the same partition. This kind of workload is perhaps unusual, but not unheard of: for example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when doing something. This event can result in a large volume of writes to the same key. Hashing the key doesn’t help, as the hash of two identical IDs is still same. This problem requires the application to handle such highly skewed workload. For example, if one key is known to be a hot-spot, a simple technique would be to add a random number to the beginning or the end of the key. However, this would require reads to do additional work, as they have to read the data from all the keys and combine it. Its not efficient to do this for more keys, since keys with low write throughput would add lots of overhead. Thus we need some way of keeping track of which keys are being split.
Perhaps in the future, these problems would be solved the database systems altogether, but until that happens and for now, we need to think of the trade-offs for our applications.