/ distributed-systems

The trouble with distributed systems part 1

A program running on a single computer normally behaves in a fairly predictable way, i.e it either works or it doesn’t. Buggy software might give the appearance that the computer is sometimes “having a bad day”(which we often fix by doing a reboot which is technically restoring to a sane state), but most of the times this is just a consequence of badly written software logic.

There is no fundamental reason why software on a single computer should go rouge/flaky, given that the hardware is working correctly, the same operation always produces the same result (something similar to y = f(x), given the function has no side effects, it is deterministic). If there is a hardware problem (e.g, memory corruption), the consequence is usually a total system failure. An individual computer with good written software is usually fully functional or entirely broken(either 0/1), but not something in between(not a quantum bit 😜).

On the other hand, when we are writing software that runs on several computers, connected by a network, the situation is totally different. In distributed systems, we are no longer operating in a idealized system model. We have no choice but to confront the messy reality of the physical world, in which a remarkably wide range of things can go wrong. Which can be illustrated by these words of Coda Hale

In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), 
PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks, whole-DC backbone 
failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC [heating, 
ventilation, and air conditioning] system. And I’m not even an ops guy.

Thus, in a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as partial failure. The difficulty is that partial failures are nondeterministic. If we try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail. In some scenarios we might not even know whether something succeeded. If we want to make distributed systems work, we must accept the possibility of partial failures. In a small system, its quite likely that most of the components are working correctly most of the time(if something goes wrong it will never go unnoticed). However, sooner or later, some part of the system will become faulty, and the software will have to somehow handle it.

This fault handling mechanism should be part of the software design, and we as developer/operators of the software need to know what behaviour to expect from the software in case of a fault. It’s unwise to assume that faults are rare and simply hope for the best. Hence, it’s important to consider a wide range of possible faults(even fairly unlikely ones and to artificially create such situations in our staging/testing environment to see what happens). For example: Netflix’s chaos monkey.

Trouble 1 - Unreliable Networks

Distributed systems are mainly based on the shared-nothing paradigm of software development, i.e a bunch of machines connected by a network. The network is the only way those machines can communicate. We can assume that each machine has its own memory and disk, and one machine cannot access another machine’s memory or disk(except by making requests to service over the network).

Shared-nothing is not the only way of building systems though, but companies have started embracing and adopting it mainly because, it’s comparatively cheap because it requires no special hardware, it can make use of commoditised cloud computing services and it can achieve high reliability through redundancy across multiple geographically distributed data centers.

The internet is an asynchronous packet network. Wherein one node can send a message(a packet in TCP terms) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all. If we send a request and expect a response, many things could go wrong

  1. The request may have been lost(someone might have unplugged a network cable).
  2. The request might be waiting in a queue(packet congestion handling via buffering) and will be delivered later, since the recipient is overloaded.
  3. The recipient might have crashed/it might have been shut down.
  4. The recipient might have temporarily stopped responding(Ex: perhaps it is experiencing a long garbage collection pause.)
  5. The recipient may have processed the request, but the response has been lost on the network(Ex: a network switch might have been mis-configured).
  6. The recipient may have processed the request, but the response has been delayed and it will be delivered later.
    asynchronous packet network
    asynchronous packet network

In the above scenarios the sender cannot even tell whether the packet(message) was delivered; The only option is for the recipient to send a response message, which may in turn be lost or delayed. These issues are indistinguishable in an asynchronous network. The only information we have is that we haven’t received a response yet. If we send a request to another node and don’t receive a response, it’s impossible to tell why.

The usual way of handling this issue is with a timeout as described in Importance of timeouts However, even after a timeout, we still don’t know whether the remote node got our request or not.

Detecting network faults

There are many systems which need to automatically detect faulty workers/nodes. For example:

  1. A load balancer needs to stop sending requests to a node that is dead(i.e, take it out of rotation)
  2. In a distributed database with single-leader replication, if the leader fails, one of the followers need to be promoted to be the new leader.

But, the uncertainty about the network makes it difficult to tell whether a node is working or not. In specific circumstances we might get some feedback to explicitly tell that something is not working.

  1. If we can reach the machine on which the node should be running, but no process is listening on the destination port, (e.g, because the process crashed. Usually i do it using telnet ip port), the operating system will helpfully close or refuse TCP connections by sending a RST/FIN packet in reply(will be visible as connection timed out). However, if the node crashed while it was handling our request, we have no way of knowing how much data was actually processed by the remote node.
  2. If a node process crashed(or was killed by an admin) but the node’s operating system is still running, another process can notify other nodes about the crash so that another node can take over quickly without having to wait for a timeout to expire. But this doesn’t work if the entire machine goes down or becomes unreachable(e.g, vpn mis-configuration)

Rapid feedback about a remote node being down is useful, but we cannot count on it. Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it. If we wanna be sure that a request was successful, we need a positive response from the application itself.

If a timeout is the only way of detecting a fault, then how long should the timeout be? This is a n-p hard problem 😜, for which there is no simple answer.

A long timeout means a long wait until a node is declared dead(and during this time, users may have to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown(e.g, due to a load spike on the node or the network).

Prematurely declaring a node to be dead is problematic; if the node is actually alive and in the middle of performing some action(e.g, waiting on IO/Network bounded operations), and another node takes over, the action may end up being performed twice. On the other hand, when a node is declared dead, its responsibilities need to be transferred to other nodes, which places additional load on the existing nodes and the network.

If the system is already struggling with high load, declaring nodes to be dead prematurely(without bringing in extra resources/nodes) can make the problem worse. In particular, it could happen that the node actually wasn’t dead but only slow to respond due to overload; Transferring this operation/load to other nodes can cause cascading failure(in the worst case all the nodes declare each other dead, and everything bites the dust).

Let’s imagine a fictitious system with a network medium that guaranteed a maximum delay for packets. Every packet is either delivered within some time d, or it is lost, but delivery never takes longer than d. Furthermore, let’s also assume that we can guarantee that a non failed node always handles the request within some time r. In this case we can say that every successful request receives a response within 2d + r, and if we don’t receive a response within that time, we know that either the network or the remote node is not working. If this was true, 2d + r would be a reasonable timeout to use.

Unfortunately though, most systems we work with have neither of those guarantees: asynchronous networks have unbounded delays(that is, they try to deliver packets as quickly as possible, but there is no upper limit on the time it may take for a packet to arrive), and most sever implementations cannot guarantee that they can handle requests within some maximum time. For example:

Our production http socket timeout for inter process communication ranges from 2-7s even though an API with 
a bounded response time in seconds is completely unacceptable but we've found that this setting works best for our use
case.

Network congestion and queueing

Variability of packet delays on computer networks is most often due to queueing(similar to the travel time to office in bangalore traffic 😜)

  1. If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one. On a busy network link, a packet may have to wait until it can get a slot(network congestion). If there is so much incoming data that the switch queue fills up, the packet is dropped, so it need to be resent, even though the network is functioning properly.
  2. When the packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is queued up by the operating system until the application is ready to handle it. Depending on the load on the machine, this may take an arbitrary length of time.
  3. In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued(buffered) by the virtual machine monitor, further increasing the variability of network delays.
  4. TCP by itself performs flow control, in which a node limits its own rate of sending in order to avoid overloading a network link or the receiving node. This means additional queueing at the sender before the data even enters the network

Another point to note is that TCP considers a packet to be lost if its is not acknowledged within some timeout(which is calculated from observed round-trip times), and lost packets are automatically retransmitted. Although the application doesn’t see the packet loss and retransmission, it does see the resulting delay.

Congestion
Congestion

All of these factors contribute to the variability of network delays. Queueing delays have an especially wide range when a system is close to its maximum capacity, i.e a system with plenty of spare capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very quickly.

Synchronous vs Aysnchronous networks

These distributed systems would be a lot simpler if we could rely on the network to deliver packets within some fixed maximum delay, and not to drop packets, so why can’t we solve this at the hardware level and make the network reliable so that the software doesn’t need to worry about it ?

To answer this, let’s compare internet/data-center networks to the traditional fixed-line telephone network, which is extremely reliable(reliance jio is an exclusion), where delayed audio frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency and enough bandwidth to transfer the audio samples of our voice. Wouldn’t it be nice to have similar reliability and predictability in computer networks?

When we make a call over the telephone network, a circuit gets established (If you can recall the first chapter’s packet switching section of the Computer communication networks(Garcia) book 😜), i.e a fixed guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. This circuit remains in place until the call ends. This kind of network is synchronous even as data passes through several routers, it doesn’t suffer from queueing, because some space for the call is already reserved in the next hop of the network, and because there is no queuing, the maximum, end-to-end latency is fixed, hence bounded delay

On the other hand in case of data-center networks, there’s no fixed resource allocation for the data transfer between the source and the destination hence the delays are undeterministic, i.e packets of a TCP connection opportunistically use whatever network bandwidth is available.

Why do data-center and the internet use packet switching?

The answer is because they are optimised for bursty traffic. A circuit is good for an audio/video call, which needs to transfer a fairly constant number of bits per second for the duration of the call. On the other hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular bandwidth requirement, we just want it to complete as quickly as possible.

If we wanted to transfer a file over a circuit, we would have to guess the bandwidth allocation. If we guess too low, the transfer is unnecessarily slow, leaving the network capacity unused. If the guess is too high, the circuit cannot be set up(because the network cannot allow a circuit to be established if its bandwidth allocation cannot be guaranteed). Thus using circuits for bursty data transfers wastes network capacity and makes transfers unnecessarily slow.

Conclusion

Network plays a very important role in distributed systems, hence it’s required to consider a wider range of problems that might occur due to the network and add proper handling for those in the software, test them rigorously on the stage environment. Thus enhancing resiliency and robustness of the system as a whole.

Kumar D

Kumar D

Software Developer. Tech Enthusiast. Loves coding 💻 and music 🎼.

Read More
The trouble with distributed systems part 1
Share this