25 December 2019/ general

Do's and don'ts of Error handling

This little post will be mainly focused on error handling. Before diving into error handling in software systems we’ll have to know how to identify a fault tolerant system, this is because a fault tolerant system has error handling done right. A system is deemed to be fault tolerant if it continues to be working(this is very subjective) even if something goes wrong.

Given a software system is inherently dependent on hardware, there are two types of failures

Hardware failures Relatively uncommon, we can eliminate this by replicating hardware several times, for example: let’s assume we have a hard disk which has a failure probability of 10^-3, we can decrease the failure rate of this particular disk by adding another similar disk, this will result in the failure probability going down to 10^-6
Software failures Very very common, there are n number of reasons why a software system can fail.

Another thing which we should be aware of that there’s no system which is truly fault tolerant(100%), for example: let’s say that the entire earth was taken out by a meteorite, in this scenario a truly fault tolerant system should continue to work, this would require us to switching to hot stand-by hosted in outer space, which is clearly never gonna happen.

Faults/Errors vs Failures

A fault/error is a state in which the system is deviating from the expected behavior. For example: A web API returning 500 or wrong results.
On the other hand Failure is when the entire system is down. For example: A web service which is down during a deployment or due to high load or due to GC etc.

Who finds the error

Compiler
Program/Runtime(Virtual Machine)
Developer/Tester

The error which we’ll be primarily concerned about are those which are found by the run-time.

Fault tolerance cannot be achieved using a single computer, because there is a single point of failure and we cannot guarantee that this computer will never go down, i.e, it might fail!. Hence to achieve Fault tolerance we’ll have to use several computers(Systems which can fail). This concept of several computer implies that our programs might be having one/more of the following properties

Concurrent
Parallel
Distributed
Message Passing

Among the above properties Message Passing is inevitable and is the basis on which Fault tolerant Systems which consist of several computers interact with each other.

Building a Fault tolerant system boils down to detecting errors and doing something when errors are detected. Errors can be of the following types

Errors that can be detected at compile time
Errors that can be detected at run time
Errors that can be inferred
Reproducible errors
Non-Reproducible errors(Works on my system kind of ones :P)

Thus we can adopt the following philosophy

Find methods to prove that the software system is correct at compile-time(QA/Integration tests/Unit tests etc etc)
Assume that the software is incorrect and will fail at run time and then do something about it at run time.

Types of Systems

Highly reliable(nuclear power plant control, air-traffic control), very expensive if they fail
Reliable(driver less cars, banks, telephone) moderately expensive if they fail
Dodgy(Netflix, Facebook) where we can live with the failure of such systems
No guarantee (Free apps), where nobody cares if they go down or not

How can we make software that works reasonably well even if there are errors? Following are the requirements

Concurrency - Many things should happen at the same time
Error encapsulation - One error shouldn’t take the entire system down
Fault detection - Should be able to detect that an error has occurred
Fault identification - Identify which fault has occurred
Code upgrade - Should support code upgrade without downtime
Stable storage - Should have a stable storage where the state during the fault will be stored for debugging at a later point in time

The “method”

Detect all errors and crash(may be)
If we cannot do what we wanna do, then try to do something simpler and then have a hierarchy which encloses these smaller tasks at the leaf level(ex: supervision trees)

Supervision Trees
Handle errors “remotely”(detect errors and ensure that the system is put into a safe state defined by an invariant)
Identify the “Error kernel”(part of the system which is supposed to be correct for ex: JVM garbage collector)

What should the run-time do, when it finds an error/fault?

Ignore it (Completely unacceptable)
Try to fix it (Not so reliable)
Dump the state to the log and then crash immediately so that the supervisor can restart(shouldn’t take much time) the runtime. (Works!)

What should the developer do, when he/she does not know what to do during error/fault?

Ignore it (and get fired :P)
Log it (Yes definitely)
Try to fix it (possibly, but don’t wanna make matters worse)
Crash the process immediately for facilitating a swift restart.(This will not help in sequential languages where a single thread of execution is present for ex: NodeJS)

So why is there this big deal about concurrency ?

Following is a sequential program with 1 process

Sequential Running Program
Following is a crashed sequential program before the restart

Sequential Dead Program
Following is a program with fine grained parallel/concurrent processes

Parallel Running Program
Following is a program with 1 concurrent process which has crashed

Parallel Program with 1 process crash

Now do you see the difference in the Parallel/Concurrent program? Probably not at the first glance, you’ll have to pay some attention to check the crashed process, but the entire program as such isn’t affected.

What happens after we find the process which has crashed? We could have this process registered as a leaf node under a supervision tree where the supervision tree parent node is monitoring this process, when the process crashes the parent process gets notified about the crash. Subsequently the parent can either spawn another process / let the system continue executing in the current state.

What else do we get alongside error handling using the above approach?

Scalability (the monster in the room)
Security (if the individual processes are isolated)
Better resource utilization(if the processes are very light weight, i.e sand fills up a glass jar more efficiently when compared to that of pebbles - Heavy weight processes)

Do's and don'ts of Error handling

Replication in databases Part 3

Replication in databases - Part 2