The trouble with distributed systems part 2
This is a continuation to the write up at Trouble with distributed systems - 1, where the main actor was Networks. This write up will mostly be talking about Clocks
We all know that clocks and time are important(even though there’s no such entity in nature), Software applications depend on clocks in various ways to answer questions which are similar to
- Has this request timed out yet?
- What’s the 99th percentile response time of this Api ?
- How many queries per second did this service handle on average in the last five minutes ?
- How much time did the user spend on our site ?
- How frequently should we probe the health check API on this service ?
- When was this article published ?
- When did this application start ?
Queries 1-4 are durations, on the other hand Queries 6-7 are points in time i.e events which occur on a particular date at a particular time. In a distributed system, time is a very tricky business, because communication is not instantaneous, i.e it takes time for a message to travel across the network from one machine to another. The time when a message is received is always later than the time when it is send, but due to variable delays in the network, we never accurately know how much later it will be received. This fact sometimes makes it difficult to determine the order in which things happened when multiple machines are involved(this gave rise to distributed tracing tools like zipkin).
Another important thing to note is that each machine on the network has its own clock, which is an actual hardware device(usually a quartz crystal oscillator). These devices are not perfectly accurate, so each machine will have its own notion of time, which may be slightly faster or slower than on other machines. However it is possible to synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol(NTP), which allows computer clocks to be adjusted according to the time reported by a group of servers.
Type of clocks
Time of day clocks
A time-of-day clock does what we intuitively expect of a clock: it returns the current date and time according to some calendar(also known as wall-clock time). For example, clockgettime(CLOCKREALTIME) on linux, System.currentTimeMills() in Java return the number of seconds(or milliseconds) since the epoch(Midnight UTC on 01-01-1970), according to gregorian calendar, not counting leap seconds. Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine(ideally) means the same as timestamp on another machine.
Monotonic clocks
A monotonic clock is suitable for measuring a duration(time interval), such as a timeout or a service’s response time, clockgettime(CLOCKMONOTONIC) on linux, and System.nanoTime() in Java are monotonic clocks, for example. They are monotonic because they are guaranteed to always more forward(whereas a time-of-day clock may jump back in time, if it is ahead of NTP by several minutes, it will be forced to reset which might look like the machine is jumping back in time). This clock is used for measuring durations, by checking the value now say a, do some operation and then checking the value again say b, b-a should give you the duration it took for the operation to complete. Whereas the absolute value a or b, is of very less importance, since it can denote the number of nanoseconds since the computer started, or something arbitrary. In particular, it doesn’t make any sense to compare monotonic clock values from two different computers.
Accuracy and Synchronization of clocks
Monotonic clocks do not need any kind of synchronization, but time-of-day clocks need to be set according to NTP server or other external time source in order to be useful. But these methods for getting a clock to tell the correct time aren’t always as reliable or accurate as we might hope, Some examples
- The quartz clock in a computer is not very accurate: it drifts(runs faster or slower than it should). Clock drift varies depending on the temperature of the machine.
- If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.
- If a machine is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for sometime.
- NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when we’re on a congested network with variable packet delays.
- Some NTP servers are wrong or mis-configured, reporting time that may be off by hours. Thus the NTP clients are required to query multiple servers and ignore outliers.
- In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that require accurate timekeeping. When the CPU core is shared between virtual machines, each VM will be paused for tens of milliseconds while another VM is running(Preemption). From an application’s point of view, this pause will manifest itself as the clock suddenly jumping forward.
Reliability of Synchronized clocks
Clocks while they might seem simple and easy to use, they come with a number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backwards in time, time on one machine may be different from another. These problems require us to write robust software which needs to be prepared to deal with incorrect clocks. The major problem is that incorrect clocks can go easily unnoticed. If a machine’s CPU is defective or its network is mis-configured, it most likely won’t work at all, so it will quickly be noticed and fixed. On the other hand, if its quartz clock is defective or its NTP client is mis-configured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality. If some piece of software is relying on an accurately synchronized clock, the result is more likely to be silent and subtle data loss than a dramatic crash
Case studies
- I happened to work for a firm which was a B2B software as a service provider, where wee were supposed to offer Single sign on functionality to the users of our clients. One of the clients asked us to use SAML as the protocol for authenticating its users with us. If you need a bit of context on SAML, you can read about it at Single Sign On (SAML), an important piece of information in this type of authentication is the Token expiry time. It so happened that the client’s Identity provider which was supposed to generate the token was running on a window’s server which wasn’t synchronising its time with NTP(aaarrggh windows again). When the user’s of that particular client tried to access our system, they weren’t able to login. This became a very serious issue(you know B2B clients right? they crib about everything) and got escalated to the VP level, me being a developer was told that there’s a production issue and i will have to join the war room to fix it on priority(mehhh). We did all kinds of debugging starting from adding log statements after every line of the authentication handler, till restarting the application. Still nothing worked. We also noticed that the majority of the errors said expired token or token cannot be granted in the future errors. After a day of debugging we found out the culprit(IDP was running on an un synchronised server). This example should help us understand how creepy and irritating timing issues can be sometimes.
- Ordering of events across multiple nodes: the following figure illustrates and example where an arbitrary client say A writes x=1 on node 1, this write is replicated to node 3; Another client say B increments x on node 3 (which leads to x being 2), and both these writes are replicated to node 2.
In the above example, when a write is replicated to other nodes, it is tagged with a timestamp according to the time of the day clock on the node where the write originated. The clock synchronization is very good in this example: the skew between node 1 and node 3 is less than 3ms, which is really better than what we really expect in practice. Nevertheless, the timestamps in the above example fail to order the events correctly, the write x=1 has a timestamp of 42.004s, but the write x=2 has a timestamp of 42.003s, even though x=2 occurred unambiguously later. When node 2 receives these two events, it will incorrectly conclude that x=1 is the most recent value and drop the write x=2. If effect, Client B’s increment operation is lost.
This kind of conflict resolution strategy, is called Last write wins, its widely used in both multi-leader and leaderless databases such as Cassandra and Riak.
Conclusion
Issues related to timestamps are very tricky and very hard to catch. Unless and until we have people who have dealt with such problems in production(these issues will never happen in local/staging! trust me 😜), its gonna take a good amount of caffeine and cognitive effort to identify and derive the resolution. Thus when ever timestamps are involved take some extra bit of care and think through all the scenarios related to the skew and handle it in the application before rolling out to production.