Should we design programs to randomly kill themselves?

Question

In a nutshell, should we design death into our programs, processes, and threads at a low level, for the good of the overall system?

Failures happen. Processes die. We plan for disaster and occasionally recover from it. But we rarely design and implement unpredictable program death. We hope that our services' uptimes are as long as we care to keep them running.

A macro-example of this concept is Netflix's Chaos Monkey, which randomly terminates AWS instances in some scenarios. They claim that this has helped them discover problems and build more redundant systems.

What I'm talking about is lower level. The idea is for traditionally long-running processes to randomly exit. This should force redundancy into the design and ultimately produce more resilient systems.

Does this concept already have a name? Is it already being used in the industry?

EDIT

Based on the comments and answers, I'm afraid I wasn't clear in my question. For clarity:

yes, I do mean randomly,
yes, I do mean in production, and
no, not just for testing.

To explain, I'd like to draw an analogy to multicellular organisms.

In nature, organisms consist of many cells. The cells fork themselves to create redundancy, and they eventually die. But there should always be enough cells of the right kinds for the organism to function. This highly redundant system also facilitates healing when injured. The cells die so the organism lives.

Incorporating random death into a program would force the greater system to adopt redundancy strategies to remain viable. Would these same strategies help the system remain stable in the face of other kinds of unpredictable failure?

And, if anyone has tried this, what is it called? I'd like to read more about it if it already exists.

I don't have anything useful to contribute as an answer, but this is definitely an interesting question. It would definitely force a programmer to write a decent component architecture that (correctly) copes with random component failures if those failures were guaranteed by the nature of the components themselves. — Tom W, Jun 22 '13 at 15:28
If I understand correctly, this may be slightly related: http://en.wikipedia.org/wiki/Mutation_testing . While mutation testing helps harden your tests, I think you're looking for a randomness based approach to help harden your code. — MetaFight, Jun 22 '13 at 15:43
Thanks @TomW. I know Erlang's OTP does a lot with Actors and Supervisors to manage processes. But I haven't heard of programs dying randomly on purpose. Hoping someone has already seen or tried this so I can read about it. — jimbo, Jun 22 '13 at 15:45
Actually, this concept is as old as computing, it is used in every program, and of course it has a name: it is called: bugs. — mouviciel, Jun 22 '13 at 17:50
It's not clear whether this question is about deploying the system this way or whether it is just a mode of operation for testing. — Kaz, Jun 22 '13 at 19:15
You wouldn't call a communication protocol implementation tested if you didn't test it over an unreliable network, which has to be simulated, since your equipment is reliable. — Kaz, Jun 22 '13 at 19:16
Microsoft has tried it for awhile, they call it by the codename "Windows". If it has produced better strategies is debatable... it might have just produced lowered expectations instead. — , Jun 25 '13 at 00:13
Or better still, to randomly kill their authors? ("Sorry Dave ...") — Stephen C, Jul 28 '13 at 06:03

score 60 · Answer 1 · answered Jun 22 '13 at 15:37

60

No.

We should design proper bad-path handling, and design test cases (and other process improvements) to validate that programs handle these exceptional conditions well. Stuff like Chaos Monkey can be part of that, but as soon as you make "must randomly crash" a requirement actual random crashes become things testers cannot file as bugs.

answered Jun 22 '13 at 15:37

Telastyn

109,398

10

Thanks @Telastyn. The cause of the crash could factor in here, I think. A purposeful death crash could have a side-effect (log, error code, signal) which distinguishes it from a code failure. – jimbo Jun 22 '13 at 15:49
1

Even if it helps uncover a weakness, it doesn't mean it is actionable. The risk (likelihood and degree of consequence) of repeating is a significant factor as to whether you do anything with that bug to mitigate future occurrence. It's a long term value tool for high risk systems. – JustinC Jun 22 '13 at 23:24
The idea is that even though sub-components crash randomly, the user shouldn't notice. So when a tester reports that one of the random crashes was visible to them, it would mean failure of catching the sub-component crash which would be a fileable bug. – Philipp Jul 10 '13 at 13:47
1

What is proposed is in fact a live test of bad-path handling. Many deployments, and the Netflix example is a case in point, require realistic load testing which in many cases are only feasible during actual deployment. Programmatic crashes will be very easy to detect with obvious logging -- what is of interest is the collateral damage and effect on interrelated systems. – ctpenrose Jul 27 '13 at 21:06
1

You can implement a smart random crasher (like Chaos Monkey) which lets you know when a program has randomly crashed. That way you know when you've hit a legitimate crash and when it's a stability testing crash. – Zain R Jul 30 '13 at 00:30

score 19 · Answer 2 · answered Jun 22 '13 at 17:56

The process of introducing defects in software or in hardware in order to test fault tolerance mechanisms is called fault injection.

From Wikipedia:

The technique of fault injection dates back to the 1970s when it was first used to induce faults at a hardware level. This type of fault injection is called Hardware Implemented Fault Injection (HWIFI) and attempts to simulate hardware failures within a system. The first experiments in hardware fault injection involved nothing more than shorting connections on circuit boards and observing the effect on the system (bridging faults). It was used primarily as a test of the dependability of the hardware system. Later specialised hardware was developed to extend this technique, such as devices to bombard specific areas of a circuit board with heavy radiation. It was soon found that faults could be induced by software techniques and that aspects of this technique could be useful for assessing software systems. Collectively these techniques are known as Software Implemented Fault Injection (SWIFI).

score 9 · Answer 3 · answered Jun 22 '13 at 17:09

Yes. No. Maybe.

Periodic termination is a two-edged sword. You're going to get hit with one edge or the other, and which is the lesser of two evils depends on your situation.

One edge is reliability: If you force the program to end randomly (or predictably) and in an orderly way, you can be prepared for that event and deal with it. You can guarantee that the process will exit when it's not otherwise busy doing something useful. This also guarantees that bugs that would manifest themselves beyond the sanctioned run time won't rear their ugly heads in production, which is a good thing. Apache HTTPD has a setting that will let you tune how many requests a child process (or thread in more recent versions) will serve up before terminating.

The other edge is also reliability: If you don't allow the program to run long, you'll never find bugs that manifest themselves over time. When you do finally run into one of those bugs, it's much more likely to cause the program to return a wrong answer or fail to return one at all. Worse, if you run many threads of the same job, a time- or count-induced bug could affect very large numbers of tasks all at once and result in all a 3 a.m. trip into the office.

In a setting where you run a lot of the same threads (e.g., on a web server), the practical solution is to take a mixed approach that results in an acceptable failure rate. If you run 100 threads, running a short-to-long ratio of 99:1 means that only one will exhibit long-term bugs while the others continue to do whatever it is they do without failing. Contrast that with running 100% long, where you run a much higher risk of having all of the threads fail at the same time.

Where you have a single thread, it's probably better to just let it run and fail, because the dead time during a restart may result in undesired latency when there's real work to do that would complete successfully.

In either case, it's important that there be something supervising the processes so they can be restarted immediately. Also, there's no law that says your initial decisions about how long a process should run have to be cast in stone. Collecting operational data will help you tune your system to keep failures down to an acceptable level.

I would recommend against doing random termination, because that makes it more difficult to nail down time-related bugs. Chaos Monkey does it to make sure the supervisory software works, which is a slightly different problem.

If you kill the process after a random time interval which stretches into infinity, then some processes will live forever. Therefore I do not think that killing processes randomly is incompatible with detecting issues with long-lived processes. — Joeri Sebrechts, Aug 02 '13 at 08:20

score 9 · Answer 4 · answered Jun 22 '13 at 17:18

9

Do you really mean random? Having your software randomly kill itself sounds like a terrible idea. What point would that serve?

I'm guessing what you really mean is that we should be realistic about long running threads/processes and accept that that the longer they run, the more likely they are to have encountered some sort of hidden bug, and gotten into a non-functional state. So, as a purely pragmatic measure, the lifetime of processes and threads should be limited.

I believe that back in the late 90s the Apache web server used something like this. They had a pool of worker processes (not threads) and each worker process would be killed after a fixed lifetime. This kept the server from being monopolized by worker processes that had gotten stuck in some pathological state.

I haven't worked in the area for some time, so I don't know if this is still the case.

answered Jun 22 '13 at 17:18

Charles E. Grant

16,672

6

IIS has periodic restarts built into the management UI and enabled by default. There's also memory and cpu limiting triggers, but the time based one has always struck me as odd. – Mark Brackett Jun 22 '13 at 17:21
3

To this day, youtube's solution to python memory leaks is to just restart the process. – Xavi Jun 22 '13 at 19:26
3

I don't think the OP is asking about killing the program in order to restore it to a properly functioning state, but to kill a program to test the system's ability to cope with its death and for any subsequent executions of the program to handle the remains. – mowwwalker Jun 23 '13 at 05:50
1

@MarkBrackett Unfortunately, the periodic restart seems to serve the opposite purpose by making programmers casual about bad code. If the problems caused by bad code were a pain in the neck to fix, we'd be less likely to write bad code. – Anthony Jun 23 '13 at 09:22
+1. Random is bad. By definition, it is such that you cannot predict its behavior. Even if you put it there for the purposes of closing the program every now and again, it may be that it simply doesn't get done, being random as it is, thus defeating the purpose of having it there to begin with. Having the processes close in predictable moments might be easier for the programmer and also to the marketer trying to sell that particular feature.. "Yes, that's right. It closes at random moments! No, It's a feature! Hello? Hello?!" – Neil Aug 02 '13 at 08:05

score 7 · Answer 5 · answered Jun 23 '13 at 09:09

The problem I see is that if such a program dies, we'll just say "Oh it's just another random termination - nothing to worry about". But what if there is a real problem that needs fixing? It will get ignored.

Programs already "randomly" fail due to developers making mystaykes, bugs making it into production systems, hardware failures, etc. When this does occur, we want to know about it so we can fix it. Designing death into programs only increases the probability of failure and would only force us to increase redundancy, which costs money.

I see nothing wrong with killing processes randomly in a test environment when testing a redundant system (this should be happening more than it is) but not in a production environment. Would we pull a couple of hard drives out from a live production system every few days, or deactivate one of the computers on a aircraft as it is flying full of passengers? In a testing scenario - fine. In a live production scenario - I'd rather not.

If you would implement random termination, you certainly would print a log message "now I'm terminating" sucht that you can differentiate deliberate random terminations from bugs. ;-) Also, restarting one of a couple processes once in a while would not need more reduncancy as you should have anyway. — Dr. Hans-Peter Störr, Jun 25 '13 at 06:47

Kaz · Answer 6 · 2013-06-24T22:10:25.037

Adding random exit code to the application should not be necessary. Testers can write scripts which randomly kill the application's processes.

In networking, it is necessary to simulate an unreliable network for the sake of testing a protocol implementation. This does not get built into the protocol; it can be simulated at the device driver level, or with some external hardware.

Don't add test code do the program for situations that can be achieved externally.

If this is intended for production, I can't believe it's serious!

Firstly, unless the processes exit abruptly so that in-progress transactions and volatile data is lost, then it's not a honest implementation of the concept. Planned, graceful exits, even if randomly timed, do not adequately help prepare the architecture for dealing with real crashes, which are not graceful.

If real or realistic malfunctions are built into the application they could result in economic harm, just like real malfunctions, and purposeful economic harm is basically a criminal act almost by definition.

You may be able to get away with clauses in the licensing agreement which waive civil liability from any damages arising from the operation of the software, but if those damages are by design, you might not be able to waive criminal liability.

Don't even think about stunts like this: make it work as reliably as you can, and put in fake failure scenarios only into special builds or configurations.

Unfortunately, I don't mean just for testing. I'll expand the question to explain. — jimbo, Jun 24 '13 at 21:50
If you are doing it right, these random (and not graceful!) crashes would do no lasting harm at all. That's the point: over time you can weed out all edge cases where harm occurs; some of them you will never ever see on testing machines. And if sometimes a real crash occurs you will also have no trouble. I never tried this, but it does seem sensible to me in some circumstances. Of course this is something which needs to be an official feature of the application, not something development sneaks in. — Dr. Hans-Peter Störr, Jun 25 '13 at 06:56

score 3 · Answer 7 · answered Jun 23 '13 at 13:21

You might want to search for "proactive recovery" and "rejuvenation" in the context of fault-tolerant distributed systems, to deal with arbitrary faults (i.e., not only crashed processes, but corrupted data and potentially malicious behavior too). There has been a lot of research on how often and in what conditions should a process (in an abstract sense, may actually be a VM or a host) be restarted. Intuitively, you can understand the advantages of the approach as preferring to deal with a dead process than with a traitor process...

score 2 · Answer 8 · answered Jun 22 '13 at 17:38

This is really no different than testing. If you're designing an always-available failover solution (like Netflix), then yes - you should test it. I don't know that random exits sprinkled throughout the code base is an appropriate way to test that, though. Unless you're really intent on testing that your design is resilient to shooting yourself in the foot, it'd seem more appropriate to test it by manipulating the environment around the code and verifying it behaves appropriately.

If you're not designing redundant systems, then no - you shouldn't add that feature because you added some random exits. You should just remove the random exits, and then you won't have that problem. Your environment may still fail on you, at which point you'll either chalk it up as not-supported/won't-fix or harden your code against that failure and add a test for it. Do that often enough, and you'll realize that you actually are designing a redundant system - see scenario #1.

At some point, you may determine that you're no longer sure what failures are or are not handled. Now you can start randomly pulling the rug out to detect the failure points.

The only interesting thing about the Netflix example is that they run these tests in production. That makes a certain amount of sense - some bugs really are production only things that are very hard or impossible to simulate in an isolated environment. I suspect that Netflix spent a long time in test environments before they were comfortable enough to do this in production though. And really all they're doing is trying to get crashes to occur during business hours, which makes a certain amount of sense for their market but not for a lot of others.

score 2 · Answer 9 · answered Jul 10 '13 at 12:18

The term you are looking for has been recently coined by Nassim Nicholas Taleb: Antifragility. His book Antifragile is definitely recommended. It barely mentions IT, but the unspoken, obvious parallels are most inspiring. His idea is to extend the scale of fragile <-> robust to fragile <-> robust <-> antifragile. Fragile breaks with random events, robust manages with random events and anti-fragile gains with random events.

score 1 · Answer 10 · answered Jun 22 '13 at 19:24

It depends. I've noticed that programmers tend to overgeneralize the techniques that apply to their specific domain ignoring all others. For example getting program released at the cost of fixing all of the bugs may be good... unless you program aircraft controller, nuclear reactor etc. "Don't optimize - cost of programmer is greater then cost of running program" is not necessary valid for HPC as there relatively simple program can occupy cluster for months etc. (or even a popular program that is used by large amount of users). So even if company X is doing Y for very good reason you don't necessary need to follow their footsteps as your situation might be different.

Usually the error handling routines are the worst tested part of the code - while it seems simple it is hard to simulate that there is insufficient memory or some important file is not there. For that reason I read texts that proposed for Unix kernel to randomly fail some system calls. However it would make a simple programs harder to write (if I need to plug 3 C++ libraries together to run a program on 2 files once I don't want to bother with error handling). Even with exceptions, GC you need to ensure that you left consistent state behind (imagine exception in middle of adding node to linked list).

The more distributed services you have the more the failures is question of "how frequent" then "if" or "when". In data centers disk replacement in RAIDs are part of routine operations from what I know - not an unexpected failures. If you operate on large scale you need to take it into account as even if probability of failure of one component is small, chances are that something will fail.

I don't know what exactly you are doing but to know if it is worth it you need to think if failure is something you need to take into account (as ignoring it costs) or it is something too costly to analyse (as taking errors into account costs development time).

"programmers tend to overgeneralize the techniques that apply to their specific domain" I'd like to frame this quote and hang it on the wall. It's sooooo true, and not just of software but of life in general. — Mark E. Haase, Jul 27 '13 at 19:35

score 1 · Answer 11 · answered Jul 10 '13 at 10:54

IIS server has a configurable feature which automatically recycles worker processes either after they have used a certain amount of memory or after servicing a certain number of requests or after they have been alive for a specified timespan. (http://msdn.microsoft.com/en-us/library/ms525803(v=vs.90).aspx) and (http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/1652e79e-21f9-4e89-bc4b-c13f894a0cfe.mspx?mfr=true)

When a CONTAINER like IIS does it, it makes sense to protect the server from rogue processes. However I would prefer to keep this turned off, because it doesn't make sense if you have sufficiently tested your code.

We already work on unreliable layers (hardware, network) so I would never write any code that randomly kill its threads or processes intentionally. Random killing is also a bad idea from an economic perspective- no one would use my API if they figured I've programmed it to randomly crash. Lastly, if I were to consume an API or use a system with randomly crashing threads I would have to spend a lot of money to create a robust enough monitoring mechanism for it so that I could sleep peacefully at night.

Instead If i were developing a system or an API I would write scripts or use a harness that would do this purely to stress test the resilience of the system. And I would make such a test run on all builds to identify bad builds. However, while this would be a necessary test, it could never be a "sufficient" test.

kzuberi · Answer 12 · 2013-07-27T19:46:54.087

There is a literature related this idea, its called Crash-Only software (also Recovery Oriented Computing) and you can start with this usenix paper by Candea & Fox from 2003. Rather than random kills, the author's argue you can improve system reliability by only ever stopping your programs by killing them, so having a single kill switch as a shut down button and a single well exercised start-up path for recovery.

While I'm not sure how well the idea caught on, some of the specific techniques do remain useful. For example not trusting your software to be able to shut itself down when requested and so using specialized supervisory programs (e.g. supervisord etc), and also thinking carefully about what program state is essential and make sure its recorded at appropriate times in a data store designed to enable recovery (so e.g. a sql database).

links go stale. Your answer would be stronger if you summarized the key points of crash only software in your answer. — , Jul 27 '13 at 18:25

score 1 · Answer 13 · answered Jul 28 '13 at 08:33

Truly randomly, no. But it's probably a good idea for long-running processes/threads to exit/restart at a given interval, or after having been idle for a given (but dependent on certain criteria) duration, or after executing a particular kind of task. Long-running processes build up state inevitably including stale things, can presumably hang on to memory preventing swap space to be released, all of which gets (or ought to get) cleaned up when they exit, improving general system stability.

score 1 · Answer 14 · answered Jul 30 '13 at 00:28

It depends on the type of application that you're designing.

Random crashes are a great way to test and improve the robustness of distributed (networked) systems.

In the Netflix example, when your program is depending on remote services that can fail for a variety of reasons that are out of your control (hard disk goes bad, power loss, meteor crashes into the data center, etc). Your service needs to still keep running somehow though.

How do you do that? Add in redundancy and scaling is a common solution.

For example, if a mouse chews through your server's power cable then your service should have some solution to keep running. It can for example keep redundant backup servers that it'll start using instead.

However, if your program is a single process application that doesn't operate in a network, then having it kill itself isn't going to test anything since there's no way to recover from that.

Here's some extra commentary on the Chaos Monkeys concept http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

score 1 · Answer 15 · answered Aug 02 '13 at 07:16

It is possible that a random bit flip happens due to cosmic radiation. This problem was recognized, and various techniques were developed to prevent bit flipping from happening.

However, it is not possible to fix it 100%, and memory corruption can still cause problems, and these problems are still happening (with very low probability).

Now to answer your question. Whether or not you need to design a very robust system, it depends on what you are doing. If you need to create a space craft, you better make it super robust, and then you will need to take into account every possible issue.

If you need to design a normal desktop application, then you should look at random crashes as bugs in your code.

score 0 · Answer 16 · answered Jun 22 '13 at 19:40

0

This doesn't seem that preposterous of an idea.

Android OS randomly kills and restarts user apps/services all the time. In my experience it has definitely helped me think more deeply about error conditions as well as design more robust architectures.

answered Jun 22 '13 at 19:40

Xavi

109

4

Android's actions aren't random, but activities need to be able to save state when told to. There's a subtle, but important, difference. – Blrfl Jun 22 '13 at 20:08
From what I've read there's no guarantee that onDestroy, onPause, onSaveInstanceState, etc... will ever be called on an Activity or Service. On an app level there isn't even an onDestory callback. So yes there are some hooks for graceful shutdowns, but you still have to be prepared for random exits. – Xavi Jun 22 '13 at 23:33
You're guaranteed a call to onPause() before an activity is killed. After Honeycomb, you're guaranteed that plus onStop(). Android apps are just collections of activities that happen to be related and there's no app-level concept of anything as far as execution lifecycle is concerned. – Blrfl Jun 23 '13 at 11:46
Ahh good to know. – Xavi Jun 23 '13 at 11:55

Should we design programs to randomly kill themselves?

EDIT

16 Answers16