Partitioning in databases part 2
This is a continuation to the post at Database Partitioning - 1, which relied on key-value data model. If the records were always accessed via their primary key, determining the partition is fairly straightforward. The situation is complicated when secondary indexes are also involved. A secondary index usually does not identify a document/row/record uniquely, but rather is a way of searching for occurrences of a particular value. For example: Find all actions by customer XYZ, find all articles containing the word india, find all cars who’s color is Blue and so on.
These secondary indexes are very important if searching for documents based on some attribute of the row is involved, hence they are very common in SQL, NoSQL, Full text search stores. But the common problem with these data structures is that they don’t map to partitions very neatly.
Partitioning Secondary Indexes by document Imagine that we are operating a website for selling used cars with the partition diagram as shown below Each listing above has a unique ID - let’s call it document ID - and we partition the database by document ID(for example, IDs 0-499 in partition-0, IDs 500 to 999 in partition-1 etc).
We would like to let users search for cars, allowing them to filter by color and by make, so we need a secondary index on color and make (in document database these would be attributes, in a relational database these would be columns). If we have declared the index, then the database can perform the indexing automatically. For example when a red car is added to the database, the database partition automatically adds it to the list of document IDs for the index entry color:red.
Using this approach, each partition is completely separate: each partition maintains its own secondary indexes, covering only the documents in that partition. It does not care what data is stored in other partitions. Whenever we need to write to the database - to add, remove, or update a document - we only need to deal with the partition that contains the document ID that we are writing. For this reason, a document-partitioned index is also known as local index.
However, reading from a document-partitioned index requires care: unless we have done something special with the document IDs, there is no reason why all the cars with a particular color or particular make would be in the same partition. In the above diagram, red ** cars appear both in partition-0 and partition-1. Thus, if we want to search for red cars, we need to send the query to **all partitions and combine all the results we get back.
This approach to querying a partitioned database is also known as Scatter/Gather, and it can make read queries on secondary indexes quite expensive. Even if we query all the partitions in parallel, scatter/gather is prone to tail latency amplification. Nevertheless, it is widely used: MongoDB, Riak, Cassandra, Elasticsearch all use document partitioned secondary indexes. Some people recommend that we structure the partitioning scheme in such a way that all the secondary index queries can be served from a single, partition, but that is not always possible, especially when we are using multiple secondary indexes in a single query(such as filtering cars by color and by make at the same time).
Partitioning Secondary Indexes by term Rather than each partition having its own secondary index, we can construct a global index that covers data in all partitions. However, we can’t just store that index in one node, since it would likely become a bottleneck and defeat the purpose of partitioning. A global index like this should also be partitioned, but it can be partitioned differently from the primary key index.
The above diagram illustrates what this could look like: red cars from all partitions appear under color:red in the index, but the index is partitioned so that colors starting with the letters a to r appear in partition-0 and colors starting with s to z appear in partition-1. The index on the make of the car is also partitioned similarly ( with the partition boundary being between f and h)
This kind of index is called term-partitioned, because the term we’re looking for determines the partition of the index. We can partition the index by the term itself, or using a hash of the term. Partitioning by the term itself can be useful for range scans (e.g, on a numeric property, such as asking price of the car), whereas partitioning by hash of the term gives a more even distribution of the load.
The advantage of global (term-partitioned) index over a document-partitioned index is that it can make reads efficient: rather than doing a scatter/gather over all partitions, a client only needs to make a request to the partition containing the term it wants. However, the downside of global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index ( every term in the document might be on a different partition, on a different node ). Hence this would require a distributed transaction across all the partitions affected by a write, which is not supported in all databases.
In practice, updates to global secondary indexes are often asynchronous, i.e if we read the index shortly after a write, the change which we just made may not yet be reflected in the index).
Re-balancing partitions
Many things change in the database like:
- Query throughput increases, we would want to add more CPUs to handle the load.
- Dataset size increases, so we would want to add more disks and RAM to store it.
- A machine fails, and other machines need to take over the failed machine’s responsibilities.
All of these changes call for data and requests to be moved from one node to another. This process of moving load from one node in the cluster to another node is called rebalancing. There are many re-balancing schemes, which are used in practice, but no matter which scheme is used, it is expected to meet some minimum requirements :
- After re-balancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster.
- While re-balancing is under way, the database should continue accepting reads and writes.
- No more data than necessary should be moved between nodes, to make re-balancing fast and minimize the network and disk I/O load.