How to utilise Chaos Engineering for IT Operations

Chaos Engineering (CE) is becoming a commonplace practice in the DevOps community and is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. But how would its application extend to IT operations? In actuality, CE for IT operations offers a similar framework for stress-testing a technology platform to understand its weak points and performance pitfalls under heavy pressure.

CE tends to be used primarily in DevOps during bug testing: setting up experiments to run software under different conditions, such as peak traffic, and monitoring how it functions and performs. This becomes increasingly necessary in cloud-based systems where failure to understand extreme load responses could result in runaway cascade failures. Worse yet, it could spin up thousands of extra nodes handling error conditions while not doing any actual work. These same principles, applied to IT operations management (ITOM), help define a functional baseline + tolerances for infrastructure, policies, and processes by clarifying both steady-state and chaotic outputs when extremes are reached.

Applications in IT

Netflix was one of the first companies to gain traction in CE used in DevOps as they moved from physical to virtual infrastructure, with the team that implemented it on AWS breaking off to form Gremlin. However, Chaos Engineering is not typically used in IT operations, because ITOM has historically been separated from development (generally, IT monitors system dynamics, and when a problem occurs, engineering change management or ITSM is brought in to remediate the issue).

Because of the growth of containerisation in cloud applications today, IT infrastructure looks more like development environments than classical multi-tier architectures. But the limitless scale of the cloud means failures can also be limitless: microservices are well-served by testing elasticity and scalability, data flows, and resiliency through stressing the system to the edge of its tolerances and fixing their shortcomings before a public crash.

1 – 2 – 3 Chaos

Implementing chaos engineering for IT operations management provides a systematic approach to identifying weaknesses in a microservices-world. In a monolithic environment, you have visibility into performance and event metrics that may be lost with microservices designs. As a result, the need for operational insights becomes even more critical when scaling to unknown workloads. In an attempt to address the gaps in common dev tools abilities to manage extreme complexities, Netflix created Chaos Monkey, which grew out of CE principles from their own cloud-native community. This methodology is extendable to infrastructure and helps to set guardrails on platform behaviour as a whole. So how should a team bring this thinking into its IT operations management? Follow these fundamental steps:

Define the current steady state. Performing baseline analysis is a standard concept in capacity planning, upgrade strategies, and other high-impact functions. Start with something relatively small and simple so that you don’t get overwhelmed by the data, or risk interfering with the business if something goes wrong (such as security Red Teaming). For example, monitoring CPU and network utilization, which are common bottlenecks in any IT shop.

Define optimal conditions. There’s how your system generally operates, and then there’s how it should operate; these typically aren’t the same thing. CPU utilization and network latency are always affected by application efficiencies, hardware conditions, and a host of other factors. Create a standard that outlines what engineers should expect on a normal day, on an easy day, and on a very hard day. These are the control groups, and the extreme day will be the stress test.

Form a hypothesis. Where will the system break? If you’re running an application scenario such as doubling the peak traffic that even your worst day so far has seen, will your CPU maintain optimum utilization (or will the container provisioning engine smoothly deploy additional nodes) as in the variable control groups, or will it spike so severely that processes grind to a halt because there isn’t enough memory or network bandwidth left to manage the load?

Execute a real-world event (but contain the blast radius). Do something extreme, like taking down a firewall that severs connectivity to one internet service provider. This will confuse the application as it tries to respond to requests with repeated failures, ramping up CPU processes as errors return from a dead network endpoint. Log events will mount, filling the database and saturating the backbone.

Validate the hypothesis. What happened? Monitor utilisation and network throughput during the test and see where the system fell over. Is it what you expected, or did something never previously considered take place? Did new chaos erupt from the fissures in your infrastructure? Stabilise, document, and remediate.

Never stop not being afraid

Stressing a system to its absolute max—and a little bit further—to see where things go wrong allows you to understand steady-state behaviour and error-handling, so you can fix it before something breaks in new and unexpected ways. What do traffic spikes look like? What are real-world events and their impacts on your organisation?

Chaos engineering is not just for DevOps. It should be a systemic practice for load-testing (out of your comfort zone) to the point of failure. It’s a responsibility for more than microservices deployments and applies to all sorts of disciplines within the IT organisation.

Written by Michael Fisher, Product Manager for OpsRamp