LSM trees
This post talks about the internals of sorted segment tables and log segment trees.
Background
A Database Management System is a software which is used for reliable storage of data and fast retrieval of the same, a database is a collection of tables which are usually mapped to entities in our system(Employee, Organization).
Now how does the Database Management System store all these Entities/Data?
Let’s take a step back and see how we can build our own version of a very minimal DBMS system which stores key,value pairs with primitive linux tools(Pun Not intended!) We all are aware of Grep which is a poor man’s search tool(a very powerful one indeed) and Sed a powerful text manipulation tool. Let’s also assume our data is a simple key, value pair, so storing this can be done by executing the below command
echo $key,$value >> database
Retrieval of the same could be done using
grep "^$key," database | sed -e "s/^$1,//" | tail -n 1
The concept over here is simple, we keep on appending records at the end of the file on inserts and on fetch we get give the latest record. Everything works fine until someday our simple DBMS becomes popular and you get lot of insert requests and because of which retrieval takes a hit because to search for a key, we have to scan the entire file, which is basically O(n) complex operation.
How can we optimize the search [O(n)] operation to be faster, and in order to achieve this we need additional data structures called Indices. Now that we have a good understanding of the need for indices let’s see how they are implemented in practice. We all have used maps/dictionaries at some or the other point in our life, inspired by this let’s say even in our DBMS we create a in memory data structure which holds the key as its key and the line number / segment address of the row in the database file as its value, something like so with this approach our lookups are going to take O(1) time for fetching the segment address and then we seek to that line directly using the fseek function to get the line from the database file. This Map data structure is kept in memory and when a new row gets inserted, the (key, address) item gets stored in the Map, if the key gets deleted then the item in the Map is simply knocked out. This again is not going to scale out because we don’t have infinite memory.
We solve this problem by segmenting the database file on a predefined number say N. Let’s assume we have 1K records, we would have (1k/N) + 1 segmented files, which are in turn sorted by key. So if the current segment(the latest one) to which we are appending to grows beyond the threshold N we will create a new file and append the record to that new segment. Searching for a record will happen on the latest segment using binary search [log(N)] if the search fails in the latest, the algorithm will traverse to the latest - 1 segment and this keeps on going backwards until the key is found. This approach in the worst case this would be taking [K(log(N))] time. To optimize this further we can come up with a background worker who’s only job is to take all these segments and merge the records so that the number of records in each segment and the number of segments is reduced. For example let’s assume that key 100 is present in both the latest segment and the latest segment - 1 the compactor/segmentation program will remove the record from the old segment and re balance the new segment. Like below
If a record has to be deleted then it will be marked as deleted using some attribute(called TombStone), when the search happens the record with the tombstone markers are ignored.
Using this approach we have the following summary
- Concurrency can be guaranteed by allowing only one thread to do the writes, this is still faster because the write operation in our case is just a append only operation. Applications which require high throughput will benefit from this
- Deleting records can be achieved by adding another marker in each line to mark that this line is deleted
- Read operations are also going to be faster because of segmentation
Now the factors which have to be considered here are the frequency of the compaction program. Because if the compaction is very frequent then writes will take a hit, and if the compaction is very rare, then reads are going to end up becoming slow.
Solving this is done by maintaining a self balancing tree(AVL tree/ Red Black tree) of a preconfigured size K is maintained with nodes being the elements of the latest Map’s contents, which means all the writes/reads will first happen on this tree if there is a miss then the operation will fall back to the Maps of the remaining segments. Using this approach we can safely assume that the reads are going to be serviced from the in memory tree itself before falling back to the Segments, this way our compaction program and our database worker program are decoupled. To add resiliency, we could write the contents of the tree as well to disk at periodic intervals to ensure that if a crash happens then the re construction of the tree is faster. Reconstruction of the other Maps will be done by parsing the database file any which way.
This is the approach taken by many database engines like Bitcask which is used in RIAK. Thus write heavy applications can use engines whose basic principle of operation is same as that of described above. But read heavy applications will follow a different paradigm which will be explained the next post.
At the end of the day which engine / approach to use is purely use case dependent, a Search Service is always going to be read heavy, because once the entities are indexed the updates to them will be relatively rare. On the other hand a Catalog Service is write heavy because entities will frequently be inserted/updated/deleted.