Replication in databases Part 3
This write up is a continuation of the post at Replication part2, focus of this post is primarily on replication in multi leader, leader less databases.
The previous posts primarily focused on replication architectures using a single leader. Although this is the most frequently and commonly used approach, there are other alternatives.
Leader-based replication has the major downside that there is only one leader, and all writes must go through it. If the client cannot connect to the leader for what ever reasons, writes won’t happen to the database.
A common thought process for this problem can be either
- Allow multiple leaders in the cluster (master-master/active replication)
- Make all the members of the cluster act as followers/replicas(leaderless replication)
Use cases for Multi Leader Replication
Having multiple leaders in the same data center makes no sense, because the benefits outweigh the added complexity. Following are some of the situations where this configuration is reasonable
- Imagine that we have a database with replicas in several different data centers(may be to tolerate the entire DC going down, as part of disasters etc). With a normal leader-based replication setup, the leader has to be in one of the data centers, and all writes must go through that data center. Whereas in a multi-leader configuration, we can have a leader in each data center. The above diagram shows how this architecture might look like. Within each data center, regular leader-based follower replication is used. But between data centers, each data center’s leader replicates its changes to other leaders in other data centers Let’s now compare how single-leader and multi-leader configurations fare in a multi-data center deployment
- Performance In a single leader configuration, every write must go over the internet to the data center which has the leader. This can add a significant latency to writes and might defeat the entire purpose of having multiple data centers in the first place. On the other hand, in a multi data center configuration, write can be processed in the local data center and can be replicated asynchronously to the other data centers. Thus, the inter-data center network delay is hidden from users, which results in performance improvement
- Data center outages In single leader configuration, if the data center with the leader fails, fail over can promote a follower in another data center to be leader. In multi-leader setup, each data center can continue operating independently of others, replication will catch up when the failed data center leader comes back online.
- Network Partitions Traffic between data centers usually goes over the public internet, which may be less reliable than the local network within a data center. A single-leader configuration is very sensitive to problems in this inter-data-center link, because writes are made synchronously over this link. A multi-leader configuration with asynchronous replication can usually tolerate network problems better: a temporary network interruption doesn’t prevent writes being processed.
Handling write conflicts
The biggest problem with multi-leader replication as shown in the above diagram, is that write conflicts can occur, which means that conflict resolution is required.
For example, let’s consider a scenario where a wiki page is simultaneously being edited by two users, as shown in
User 1 changes the title of the page from A to B, and user 2 changes the title from A to C at the same time. Each user’s change is successfully applied to their local leader. However, when the changes are asynchronously replicated, a conflict is detected. This problem doesn’t occur in single-leader databases since there’s only one leader who processes the writes and if concurrent writes end up on a single leader, only one of them will be allowed to do the update, the other write request will have to wait until this succeeds/fails.
-
Synchronous Versus Asynchronous Conflict Detection
- In a single leader database, the second writer will either block and wait for the first write to complete, or abort the second write transaction, forcing the user to retry the write. On the other hand, in a multi-leader setup, both the writes are successful, and the conflict is detected asynchronously at some later point in time. At that time, it may be too late to ask the user to resolve the conflict. We could make the conflict resolution synchronous, i.e, wait for the write to be replicated to all replicas before telling the user that the write was successful. However by doing so, we would lose the main advantage of multi-leader replication.
- The simplest strategy for dealing with conflicts is to avoid them(prevention is better than cure :P). If the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur. For example, In an application where user can edit their own data, we cna ensure that requests from particular user can always be routed to the same data center and use the leader in that data center for reading and writing. However sometimes we might want to change this designated leader for a record- perhaps because the user moved to a different location/ data center going down due to a disaster. In these cases conflict-resolution becomes more complex.
- We can eventually converge towards a consistent state if there is ordering guarantee, however there is no such guarantee in multi-leader configuration since the operations are asynchronously replicated. In the above diagram, at leader 1 the title is first updated to B and then to C, where as at leader 2 it is first updated to C and then to B. Neither order is more correct than the other. If each replica simply applied writes in the order that it saw the writes, the database will end up in an inconsistent state. The final value would be C at leader 1 and B at leader 2. This is not acceptable because every replication scheme should ensure that the data is at least eventually safe in all the replicas. Thus the database must ensure conflicts are resolved in a convergent way. Following are the various ways in which this is achieved
- Give each write a unique ID(long random number, a UUID etc. This again is a consensus problem, if 2 instances choose the same UUID/random number we’ll end up with serious problems), pick the write with the highest ID as the winner, and throw away other writes. If a time-stamp is used as the attribute for determining the winning transaction, its called last-write-wins. This might seem popular, but its very dangerous and prone to data loss.
- Give each replica an unique ID, and let the write which originated at a higher numbered replica always take precedence over writes that originated at a lower numbered replica. This is also prone to data loss.
- Merge the values together, For example, we can order the values alphabetically and concatenate them to be B/C. This is a very use case specific approach and different applications would want this to be handled in a different way
- Write the application code handle the conflict resolution, after recording the conflict in a explicit place where all the required information is stored. Perhaps by prompting the user or something.
Custom conflict resolution logic
As the most appropriate way of resolving conflict may depend on the application, most multi-leader replication tools let the application developer write conflict resolution logic. This code might be executed on read/write.
- On write As soon as the database detects a conflict in the log of replicated changes, it calls for the conflict resolution handler. This handler cannot prompt the user and must be running in background and should finish execution quickly.
- On read When a record which has a conflict due to concurrent writes is read, the database can return both the versions to the user and the user can resolve the conflict.
Multi-leader Replication Topologies
A replication topology describes the communication paths along which writes are propagated from one node to another. If we have two leader then we’ll have only one topology, with more than two nodes, various different topologies are possible as shown in the diagram below
The most general topology is all-all, in which every leader send its writes to every other leader. However, more restricted topologies are also used, for example MySQL by default supports only circular-topology, in which each node receives writes from one node and forwards those writes(plus any writes of its own), to one other node. Another popular topology is the star topology, where one designated root forwards all writes to all the other nodes. In both circular and start topologies, a write may need to pass through several nodes before it reaches all the replicas. Therefore, nodes need to forward data changes they receive from other nodes. To prevent infinite replication loops(flooding), each node is given a unique identifier and in the replication log, each write is tagged with the identifiers of all the nodes it has passed through. When a node receives a data change that is tagged with its own, identifier, the data change is ignored.
The problem with star & circular topologies is that failure of one node in the cluster would make the replication process come to an halt, and replication will not happen until the node comes back online.
On the other hand, all-all topology has problems as well, where some network links are faster, this bandwidth imbalance would result in write conflicts as shown in the diagram below In this scenario, client A inserts a row into table on leader 1, and client B updates the row on leader 3, However leader 2 may receive the writes in a different order, it may first receive the update(which, from its point of view, is an update to a row that doesn’t even exist in the database) and only later receive the corresponding insert. This is a problem of causality, similar to the one which we saw in Consistent Prefix Reads of Database Replication-Part 2. The update here depends on the prior insert, so we need to make sure that all nodes process the inserts first and then process the updates. Simply attaching a timestamp to the operation because clocks cannot be trusted(trust me, I’ll explain more about this in another write up:P), because clocks on the nodes need to be in sync to correctly order these operations at the leader.
Thus, the key take away here is that, if you are gonna use a system with multi-leader replication, it is worth being aware of these issues. Carefully testing the system with the data to ensure that it provides the necessary guarantees will definitely save lot of unwanted effort to resolve issues which could have been avoided.