78

In a nutshell, should we design death into our programs, processes, and threads at a low level, for the good of the overall system?

Failures happen. Processes die. We plan for disaster and occasionally recover from it. But we rarely design and implement unpredictable program death. We hope that our services' uptimes are as long as we care to keep them running.

A macro-example of this concept is Netflix's Chaos Monkey, which randomly terminates AWS instances in some scenarios. They claim that this has helped them discover problems and build more redundant systems.

What I'm talking about is lower level. The idea is for traditionally long-running processes to randomly exit. This should force redundancy into the design and ultimately produce more resilient systems.

Does this concept already have a name? Is it already being used in the industry?

EDIT

Based on the comments and answers, I'm afraid I wasn't clear in my question. For clarity:

  • yes, I do mean randomly,
  • yes, I do mean in production, and
  • no, not just for testing.

To explain, I'd like to draw an analogy to multicellular organisms.

In nature, organisms consist of many cells. The cells fork themselves to create redundancy, and they eventually die. But there should always be enough cells of the right kinds for the organism to function. This highly redundant system also facilitates healing when injured. The cells die so the organism lives.

Incorporating random death into a program would force the greater system to adopt redundancy strategies to remain viable. Would these same strategies help the system remain stable in the face of other kinds of unpredictable failure?

And, if anyone has tried this, what is it called? I'd like to read more about it if it already exists.

jimbo
  • 861
  • 13
    I don't have anything useful to contribute as an answer, but this is definitely an interesting question. It would definitely force a programmer to write a decent component architecture that (correctly) copes with random component failures if those failures were guaranteed by the nature of the components themselves. – Tom W Jun 22 '13 at 15:28
  • 1
    If I understand correctly, this may be slightly related: http://en.wikipedia.org/wiki/Mutation_testing . While mutation testing helps harden your tests, I think you're looking for a randomness based approach to help harden your code. – MetaFight Jun 22 '13 at 15:43
  • Thanks @TomW. I know Erlang's OTP does a lot with Actors and Supervisors to manage processes. But I haven't heard of programs dying randomly on purpose. Hoping someone has already seen or tried this so I can read about it. – jimbo Jun 22 '13 at 15:45
  • 10
    Actually, this concept is as old as computing, it is used in every program, and of course it has a name: it is called: bugs. – mouviciel Jun 22 '13 at 17:50
  • It's not clear whether this question is about deploying the system this way or whether it is just a mode of operation for testing. – Kaz Jun 22 '13 at 19:15
  • 3
    You wouldn't call a communication protocol implementation tested if you didn't test it over an unreliable network, which has to be simulated, since your equipment is reliable. – Kaz Jun 22 '13 at 19:16
  • In production? Sure, why not? – Erik Reppen Jun 22 '13 at 19:57
  • 5
    Microsoft has tried it for awhile, they call it by the codename "Windows". If it has produced better strategies is debatable... it might have just produced lowered expectations instead. –  Jun 25 '13 at 00:13
  • Why not make them to randomly kill others ? – Tulains Córdova Jul 10 '13 at 11:54
  • Or better still, to randomly kill their authors? ("Sorry Dave ...") – Stephen C Jul 28 '13 at 06:03

16 Answers16

60

No.

We should design proper bad-path handling, and design test cases (and other process improvements) to validate that programs handle these exceptional conditions well. Stuff like Chaos Monkey can be part of that, but as soon as you make "must randomly crash" a requirement actual random crashes become things testers cannot file as bugs.

Telastyn
  • 109,398
  • 10
    Thanks @Telastyn. The cause of the crash could factor in here, I think. A purposeful death crash could have a side-effect (log, error code, signal) which distinguishes it from a code failure. – jimbo Jun 22 '13 at 15:49
  • 1
    Even if it helps uncover a weakness, it doesn't mean it is actionable. The risk (likelihood and degree of consequence) of repeating is a significant factor as to whether you do anything with that bug to mitigate future occurrence. It's a long term value tool for high risk systems. – JustinC Jun 22 '13 at 23:24
  • The idea is that even though sub-components crash randomly, the user shouldn't notice. So when a tester reports that one of the random crashes was visible to them, it would mean failure of catching the sub-component crash which would be a fileable bug. – Philipp Jul 10 '13 at 13:47
  • 1
    What is proposed is in fact a live test of bad-path handling. Many deployments, and the Netflix example is a case in point, require realistic load testing which in many cases are only feasible during actual deployment. Programmatic crashes will be very easy to detect with obvious logging -- what is of interest is the collateral damage and effect on interrelated systems. – ctpenrose Jul 27 '13 at 21:06
  • 1
    You can implement a smart random crasher (like Chaos Monkey) which lets you know when a program has randomly crashed. That way you know when you've hit a legitimate crash and when it's a stability testing crash. – Zain R Jul 30 '13 at 00:30
19

The process of introducing defects in software or in hardware in order to test fault tolerance mechanisms is called fault injection.

From Wikipedia:

The technique of fault injection dates back to the 1970s when it was first used to induce faults at a hardware level. This type of fault injection is called Hardware Implemented Fault Injection (HWIFI) and attempts to simulate hardware failures within a system. The first experiments in hardware fault injection involved nothing more than shorting connections on circuit boards and observing the effect on the system (bridging faults). It was used primarily as a test of the dependability of the hardware system. Later specialised hardware was developed to extend this technique, such as devices to bombard specific areas of a circuit board with heavy radiation. It was soon found that faults could be induced by software techniques and that aspects of this technique could be useful for assessing software systems. Collectively these techniques are known as Software Implemented Fault Injection (SWIFI).

mouviciel
  • 15,491
  • It fits as second level stress testing. After the contrived stress tests have passed [to a satisfying degree], insert some randomness to ensure unexpected environment changes aren't catastrophic. It can be valuable when failure is high risk (likelihood or severity of consequence). I would not deploy to live until I was very confident in a lab environment, and then only incrementally for the parts I was most confident in.
  • – JustinC Jun 22 '13 at 23:17