Do's and don'ts of Error handling
This little post will be mainly focused on error handling. Before diving into error handling in software systems we’ll have to know how to identify a fault tolerant system, this is because a fault tolerant system has error handling done right. A system is deemed to be fault tolerant if it continues to be working(this is very subjective) even if something goes wrong.
Given a software system is inherently dependent on hardware, there are two types of failures
- Hardware failures Relatively uncommon, we can eliminate this by replicating hardware several times, for example: let’s assume we have a hard disk which has a failure probability of 10^-3, we can decrease the failure rate of this particular disk by adding another similar disk, this will result in the failure probability going down to 10^-6
- Software failures Very very common, there are n number of reasons why a software system can fail.
Another thing which we should be aware of that there’s no system which is truly fault tolerant(100%), for example: let’s say that the entire earth was taken out by a meteorite, in this scenario a truly fault tolerant system should continue to work, this would require us to switching to hot stand-by hosted in outer space, which is clearly never gonna happen.
Faults/Errors vs Failures
- A fault/error is a state in which the system is deviating from the expected behavior. For example: A web API returning 500 or wrong results.
- On the other hand Failure is when the entire system is down. For example: A web service which is down during a deployment or due to high load or due to GC etc.
Who finds the error
- Compiler
- Program/Runtime(Virtual Machine)
- Developer/Tester
The error which we’ll be primarily concerned about are those which are found by the run-time.
Fault tolerance cannot be achieved using a single computer, because there is a single point of failure and we cannot guarantee that this computer will never go down, i.e, it might fail!. Hence to achieve Fault tolerance we’ll have to use several computers(Systems which can fail). This concept of several computer implies that our programs might be having one/more of the following properties
- Concurrent
- Parallel
- Distributed
- Message Passing
Among the above properties Message Passing is inevitable and is the basis on which Fault tolerant Systems which consist of several computers interact with each other.
Building a Fault tolerant system boils down to detecting errors and doing something when errors are detected. Errors can be of the following types
- Errors that can be detected at compile time
- Errors that can be detected at run time
- Errors that can be inferred
- Reproducible errors
- Non-Reproducible errors(Works on my system kind of ones :P)
Thus we can adopt the following philosophy
- Find methods to prove that the software system is correct at compile-time(QA/Integration tests/Unit tests etc etc)
- Assume that the software is incorrect and will fail at run time and then do something about it at run time.
Types of Systems
- Highly reliable(nuclear power plant control, air-traffic control), very expensive if they fail
- Reliable(driver less cars, banks, telephone) moderately expensive if they fail
- Dodgy(Netflix, Facebook) where we can live with the failure of such systems
- No guarantee (Free apps), where nobody cares if they go down or not
How can we make software that works reasonably well even if there are errors? Following are the requirements
- Concurrency - Many things should happen at the same time
- Error encapsulation - One error shouldn’t take the entire system down
- Fault detection - Should be able to detect that an error has occurred
- Fault identification - Identify which fault has occurred
- Code upgrade - Should support code upgrade without downtime
- Stable storage - Should have a stable storage where the state during the fault will be stored for debugging at a later point in time
The “method”
- Detect all errors and crash(may be)
- If we cannot do what we wanna do, then try to do something simpler and then have a hierarchy which encloses these smaller tasks at the leaf level(ex: supervision trees)
- Handle errors “remotely”(detect errors and ensure that the system is put into a safe state defined by an invariant)
- Identify the “Error kernel”(part of the system which is supposed to be correct for ex: JVM garbage collector)
What should the run-time do, when it finds an error/fault?
- Ignore it (Completely unacceptable)
- Try to fix it (Not so reliable)
- Dump the state to the log and then crash immediately so that the supervisor can restart(shouldn’t take much time) the runtime. (Works!)
What should the developer do, when he/she does not know what to do during error/fault?
- Ignore it (and get fired :P)
- Log it (Yes definitely)
- Try to fix it (possibly, but don’t wanna make matters worse)
- Crash the process immediately for facilitating a swift restart.(This will not help in sequential languages where a single thread of execution is present for ex: NodeJS)
So why is there this big deal about concurrency ?
- Following is a sequential program with 1 process
- Following is a crashed sequential program before the restart
- Following is a program with fine grained parallel/concurrent processes
- Following is a program with 1 concurrent process which has crashed
Now do you see the difference in the Parallel/Concurrent program? Probably not at the first glance, you’ll have to pay some attention to check the crashed process, but the entire program as such isn’t affected.
What happens after we find the process which has crashed? We could have this process registered as a leaf node under a supervision tree where the supervision tree parent node is monitoring this process, when the process crashes the parent process gets notified about the crash. Subsequently the parent can either spawn another process / let the system continue executing in the current state.
What else do we get alongside error handling using the above approach?
- Scalability (the monster in the room)
- Security (if the individual processes are isolated)
- Better resource utilization(if the processes are very light weight, i.e sand fills up a glass jar more efficiently when compared to that of pebbles - Heavy weight processes)