In systems programming (and application programming, and life in general) there are two ways to deal with potential problems: You can try to avoid them, or you can handle them intelligently. When we walk across the street, we look both ways: This is a simple avoidance. It’s easier to look both ways and not have to intelligently handle the oncoming car. We will, hopefully, however, intelligently handle the oncoming car in the event our avoidance strategy fails. We will jump out of the way, or throw a brick at the windshield, whatever. It continues to boggle my mind the lengths people will go to to avoid the most random corner-cases, but fail to intelligently handle even the most obvious of exceptions, even ones that continuously rear their heads.
Avoidance scenarios in code are often completely unnecessary and several orders more complex than handling the inevitable exception to begin with. The simplest solution is almost always the best, and always the best when you can’t rely on your avoidance strategy to begin with! If your program calls some code, and expects to get back a 0 or 1, what happens when it gets a 2? What happens when some field in a database you flagged as never null (an avoidance) is null for some records?
Avoidance is about assumptions. A lot of programmers love assumptions, they code for them often, sometimes even writing comments in their code “the blahblah field is set to ‘never null’ so we can assume that to be true” – Really? Until an admin turns off constraints and bulk loads some “bad” data. Instead of writing the cheeky comment making the assumption, you could have written a one-line handler to Do The Right Thing. If you toss what you think you know out the window, program with the facts in mind, and handle failure cases elegantly, you end up with a durable system.
Your system application sends a file from one system to another every day. Usually it works, sometimes it doesn’t. When it doesn’t, what happens? Nothing, because you never coded for that eventuality. Why would you? It’s not your fault if the other side doesn’t work, after all. Your code is perfect. The intelligent handling to this is two lines of code: after you send the file, check to make sure the file is actually on the other end, and retry if it’s not (or send you an e-mail, or write something to a log… SOMETHING). I can list over 60 different reasons that file may not be on the other side- you can add mitigators for those 60 reasons (plus the other 60 I didn’t bother thinking of), or you can intelligently handle the problem with two lines of code, actually, zero lines if you’re elegant.
SendFile(fileName,fromHere,toThere)
could be replaced with:
until(FileExists(“toThere/fileName”)) { SendFile(fileName,fromHere,toThere); }
That’s pseudo-code (‘until( .. )’ is the same as ‘while(not .. )’, if you’re using a primitive language) but it will work semantically the same in at least 16 different programming languages. This doesn’t mean you shouldn’t try to figure out why it’s failing and perhaps fix something that’s broken along the way, but your code doesn’t need that complexity, it just needs to Do The Right Thing.
Another classic example is how one handles multiple instances. Some software runs fine with multiple copies (like your web browser) some behave badly (like your e-mail client), some even worse. Frequently, even when they know that running two or more instances at the same time is very very bad (will-destroy-data bad), they don’t handle that event, they avoid it. They say “well the code runs via a scheduler, and there’s enough time in-between runs that it should be done”. Should. You may destroy data and cause more work, for a “should”. The handler is a 2-line fix:
open ME, “<$0″ or exit;
flock ME, LOCK_EX | LOCK_NB or exit;
Lock yourself, and exit if you can’t get a lock. If locking yourself seems too conceptually scary, then pick a lock file (that’s what the /var/lock subsystem is for, by the way) and lock on that. The code is Perl, but the concept will work in at least a dozen different languages, and is bullet-proof (assuming your host OS supports flock).
There are a lot of reasons people don’t write durable programs- I don’t pretend to do it all of the time, either. Laziness and ignorance are probably tied for first, followed closely by apathy. If your system is “critical” (to you, or someone else) and if/when it doesn’t function it generates work, there is almost always a 1-to-2 line solution to help it survive, or at the very least, not do the wrong thing.