How chaotic is chaos engineering really?

When you binge-watch the latest Netflix shows tonight, you’ll be enjoying the fruits of a lot of chaotic software testing.

In fact, Netflix is not only the birthplace of streaming hits like Stranger Things and Bird Box, but it is also the birthplace of the discipline of chaos engineering.

Chaos engineering sounds like a freaky combination of mutually opposing forces, but it has become an increasingly important approach to how complex modern technology architects are improved.

The origin of chaos engineering was in Netflix’s original online DVD rental business. A single database corruption caused a major systems outage that meant no DVDs could be shipped for three days.

This led Netflix engineers to migrate from a monolithic on-premise software stack to a distributed cloud-based architecture running on Amazon Web Services.

While a distributed architecture and hundreds of microservices eliminated a single point of failure, it created a much more complex system to manage and look after. This led to a counterintuitive realisation that to avoid failure, the Netflix engineering team had to experience failing constantly.

The result was Chaos Monkey, a tool created by Netflix’s software engineers to roam across its complex architecture and proactively cause failures in random places at random intervals throughout their systems. Through Chaos Monkey’s antics, the team quickly learnt whether or not the services they built were robust and resilient enough to survive unplanned failures.

And so was chaos engineering born. It is formally defined as the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Netflix released Chaos Monkey with an open source licence, so now a growing number of organisations, such as Amazon, Google and Nike, are using chaos engineering.

How chaotic is chaos engineering really?

Actually, it is not strictly chaotic. A better analogy than unleashing a wild monkey on your systems is a medical one. Consider how a doctor injects you with part of a flu virus to prevent you from getting full-blown flu. Similarly, chaos engineering is about carefully injecting harm into systems to test how they respond.

In practice, successful chaos engineering is about doing a series of thoughtful, planned, and controlled experiments that are designed to demonstrate how your systems behave in the face of failure.

What you are attempting to do is to severely test the resilience of the technology to the point of failure, not the business. One way to think of what you are trying to do is minimise the blast area. To quote Michael Caine in The Italian Job “You’re only supposed to blow the bloody doors off.”

Running chaos engineering needs a large amount of meticulous planning. This includes planning for the unexpected. For example, if you are experimenting on Kubernetes, make sure Kubernetes engineers are available to mend what your test might have broken.

The other misnomer about chaos engineering is it just about testing systems. Chaos engineering is more akin to doing pure science. Modern software systems are often too complex to fully understand, so chaos engineering is about performing experiments to expose the unknown of these systems.

While software tests tell us whether something passed or failed a test, a chaos engineering experiment widens our knowledge about systemic weaknesses.

However, chaos engineering experiments do need to be carefully structured:

1. What is your system’s ‘steady state’?

Start by pinpointing metrics that indicate, in real time, that your systems are working the way they should. For example, Netflix uses the rate at which customers press the play button on a video streaming device as steady state, calling this “streams per second”.

2. Set up a hypothesis

As with any experiment, you need a hypothesis to test. Because you’re trying to disrupt the usual running of your system – the steady state – your hypothesis will be something like, “When we do X, there should be no change in the steady state of this system.” Your chaos engineering activities should involve real experiments, involving real unknowns.

3. Simulate what could happen in the real world

You want to simulate scenarios that have the potential to make your systems become unavailable or degrade performance. Ask yourself, “What could go wrong?” and then simulate that. Be sure to prioritise potential errors, too. Some ideas could be forcing system clocks to go out of sync, execute a routine in driver code that emulates I/O errors, induce latency between services, or randomly cause functions to throw exceptions.

4. Prove or disprove your hypothesis

Compare your steady-state metrics to those you gather after injecting the disturbance into your system. If you find differences in the measurements, your chaos engineering experiment has succeeded – you can now go on to strengthen and prepare your system so a similar incident in the real world doesn’t cause problems. Alternatively, if you find that your steady state remains steady, you may walk away with a higher degree of trust in that part of your system.

Ultimately, the purpose of chaos engineering is not about breaking systems but rather learning about how systems might fail. In a way, it is about doing pre-mortems that are in your control rather than post-mortems after a real outage that causes uncontrolled and painful damage.

Teams value how they absorb these pre-mortem findings more profoundly through the experience of running a chaos engineering project. The structured approach of chaos engineering can free up the expert intuition of the people who know the systems best, to test out a hypothesis that’s niggled them for some time.

Chaos engineering can seem scary. But, when done in a controlled way, it can be invaluable in helping understand how complex modern systems can be made more resilient and robust. Netflix’s faith in chaos engineering has grown to such an extent that they have a large simian army of chaos monkeys (and a gorilla) to check everything from latency and security, to conformity.

Don’t be afraid to let some chaos in.

Mark Fieldhouse, VP, Northern Europe, New Relic