Replication in databases - Part 2
This write up is a continuation of the post at Replication part1 if you haven’t read it, i would recommend you to read it first and then jump into this write up.
Let me tell you how leader based replication works under the hood. There are several methods which are used in practice,
Statement-based replication
In the simplest case, the leader logs every write request(statement) that it executes and sends the statement log to its followers(replicas). For a relational database this means that every INSERT, UPDATE AND DELETE statement is forwarded to followers, and each of the follower parses and executes that SQL statement as if it had been received from a client. Although this sounds reasonable, there are many ways in which this approach to replication can break:
- Any statement that calls a non-deterministic function, such as NOW() to get the current date and time or RAND() to get a random number, is likely to generate a different value on each replica.
- If statements use an auto incrementing column, or if they depend on the existing data in the database(eg, UPDATE…
WHERE
), they must be executed in exactly the same order on each replica, or else they may have a different effect. This can be limiting when there are multiple concurrently executing transactions. - Statements that have side effects (eg, triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless the side effects are absolutely deterministic.
Even though it is possible to work-around these issues- for example, the leader can replace anny non-deterministic function calls with a fixed return value when the statement is logged so that the followers all get the same value. However, because there are so many edge cases, other replication methods which follow are generally preferred.
Write-ahead log shipping
In my post about SSTables & LSM-Trees SSTables&LSM-Trees, I’ve explained how log becomes the main place of storage and how log segments are compacted and garbage-collected in the background. Similarly in case of a B-Tree(which overwrites individual disk blocks), every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash. In either case, the log is an append-only sequence of bytes containing all the writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers. The follower upon processing this log, builds a copy of the exact same data structures as found on the leader. The main disadvantage of this approach is that the log describes data at a very low level: a WAL contains details of which bytes were changed in which disk blocks. This makes replication closely coupled with the storage engine(InnoDB, MyISAM etc). So if the database changes its storage format from one version to another, it is typically impossible to run different versions of the database software on the leader and replicas.
Row-Based Replication
This approach uses different formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals. This kind of replication log is called logical-log, to distinguish it from the storage engine’s(physical) data representation. This logical log is usually a sequence of records describing the writes to database tables at the granularity of a row( In case of RDBMS).
- For an inserted row, the log contains the new values for all columns.
- For a deleted row, the log contains enough information to uniquely identify the row that was deleted(Tombstone). Typically this would be the primary key, but if there is no primary key on the table, the old values of the columns need to be logged(for identification).
- For an updated row, the log contains enough information to uniquely identify the updated row, and the new values for all the columns(or at least the new values of all columns that’ve changed).
The above replication methods are all implemented by the database itself. There are other approaches where replication is handled at the application layer. One among them is trigger-based replication, A trigger let’s you register custom application code that is automatically executed when a data change(write transaction) occurs in the database. The trigger has the opportunity to log this change to a separate table, from which it can be read by an external process. This external process can then apply necessary application logic and replicate the data to another system(not to be confused with Command-Query Responsibility Segregation). Even though this approach has the flexibility of adding custom logic, in practise trigger-based replication is very slow.
Problems with Replication Lag
As described in Part-1, in read-scaling architecture, we increase the capacity of serving read-only requests by adding new replicas. However, this approach works realistically in asynchronous replication mode, since synchronous mode adds single point of failure equally on all of the members of the cluster. But if a client reads from an asynchronous follower, it may see outdated information due to Replication Lag(Follower following behind leader). If we run the same query on both the leader & follower we might see different results, because not all writes have been executed in the follower. This might be just a temporary state, if we stop writing to the leader and wait for a while, the followers will eventually catch up and become consistent with the leader. This effect is known as Eventual-Consistency.
Following are the problems that are likely to occur when there is replication lag:
- Reading your own writes: All the applications allow the client/user to submit some data(write) and then view the data which he/she submitted(read). This might be anything ranging from a customer record, comment on a discussion thread etc. When new data is submitted, it must be sent to the leader, but when the user views the data, it can be read from a follower. This is especially appropriate if the data is frequently viewed, but occasionally written. The problem here is described at
To the user here, it looks as though the data he/she submitted was lost, so they will be unhappy. Even worse, they might try submitting the data multiple times. We need read-after-write consistency here, which guarantees that if the user reloads the page, or reads the data again, they will always see any updates they made.
-
Solution
- When reading something that the user may have modified, read it from the leader otherwise read it from the replica. Question here is how will the system get to know that the data has been modified without actually querying it. A simple thought is to do this on the basis of the nature of data - for Example: an user-profile information is always editable, hence when the user requests for user-profile information, always serve it from leader.
- This again is not effective if most of the things are potentially editable by the users, in this case the approach would be to track the last update time on the follower and for a threshold time say(1 minute) after the last update, serve all the reads from the follower.
- Thirdly the client can remember the timestamp of the most recent write, then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If the replica isn’t up to date, either the read can be handled by another replica or the query can wait until the replica has caught up.
Another complication is when the same user is accessing the services from multiple devices, in this case we will have to provide cross-device-read-after-write consistency(and my lecturer said databases were easy to understand :| ). In this architecture there are multiple pieces of meta-data which has to be centralised(which goes against availability).
- Monotonic reads The second anomaly that can occur is that it’s possible for an user to see things moving backward in time(I’ve witnessed this while checking for score in ESPN :P ). In the scenario above, user 2345 makes the same query twice, first to follower with little lag, then to a follower with greater lag.(Very likely scenario if the user keeps refreshing the page frequently, so that the requests land up on different nodes). The first query returns a comment that was recently added by user 1234, but the second query doesn’t return anything because the lagging follower has not yet picked up that write. In effect, the second query is observing the system at an earlier point in time than the first query. This wouldn’t be so bad if the first query hadn’t returned anything, because user 2345 probably wouldn’t know that user 1234 had recently added a comment. However, it’s very confusing for user 2345, if he/she first see user 1234’s comment appear, and then disappear again.
Monotonic reads is a guarantee that this kind of anomaly doesn’t happen. It’s a weaker guarantee than strong consistency, but a stronger guarantee than eventual consistency. When a client reads the data, he/she might see an old value, monotonic reads means that even if the same client makes several reads, he/she will not see the time moving backward(D1 > D2 > D3 all the data being ordered with time). One way of solving this is to ensure that a client always reads from the same replica(based on consistent hashing techniques etc). But the pain-point here is that if the replica goes down, then the requests need to be re-routed :(
- Consistent Prefix Reads The third anomaly is concerned with violation of causality. Imagine the following short dialog conversation between Mr. Foo and Mrs. Bar(Teetotaller):
Mr. Foo What have you been doing these days Mrs. Bar? Mrs. Bar Just dozing, nothing much Mr. Foo.
As you can see, there is a causal dependency between these two sentences: Mrs. Bar heard Mr. Foo’s question and answered it. Now, let’s imagine a third person(Eaves dropper), listening to this conversation through replicas. The things said by Mrs. Bar go through a follower with a little lag, but the things said by Mr. Foo have a longer replication lag let’s suppose. Then the eaves dropper would hear the following Mrs. Bar Just dozing, nothing much Mr. Foo. Mr. Foo What have you been doing these days Mrs. Bar. After listening to this Eaves dropper be like
Preventing this anomaly requires another type of guarantee called consistent-prefix-reads. This guarantee holds the invariant that if a sequence of writes happen in a certain order, then anyone reading those writes will see them in the same order. This problem is very evident in partitioned(sharded) databases.
Coping with replication lag
As described above, when working with eventual-consistent systems its worth-while thinking about how the application behaves if the replication lag increases to minutes/hours. If the answer is No-Problem, then we don’t have to worry. However, in most of the cases, these anomalies result in providing a bad user experience, hence it is worth-while to spend some time to design systems which take care of these problems.
All the while until now, I’ve written about Single-Leader databases, the next post will be about Multi-Leader & Leaderless databases.