Modern applications are increasingly growing in complexity. Adding a dizzying amount of moving parts, layers of abstraction, reliance on external systems and distribution that all result in a stack that few truly fully understand.
Any developer worth hiring now knows the merits of a thorough testing regime, but one of the issues with testing is that you are often testing for predictable outcomes. Despite our 'logical systems,' show-stopping issues are typically unexpected; situations that no one foresaw.
These unforeseen eventualities are what chaos engineering attempts to account for. It's a reasonably new principle, practiced by Netflix for several years and then formalized in 2015, setting out its principles in a time-honored manifesto.
Naturally, there are critics of the practice, and the comments at the bottom of this TechCrunch article summarize some of them. The typical counterarguments are that the principle is a band-aid for applications that were poorly planned and architected in the first place, or that it's another buzzword-laden excuse to invent shiny new tools that no one knew they needed.
Still, it's proponents are a friendly bunch, so in this article, I summarize my findings on the practice and let you decide.
In many ways, while the term 'chaos' is a good eye-catching phrase, it's misleading, summoning images of burning servers and hapless engineers running around an office screaming. A better term is experimental engineering, but I agree that is less likely to get tech blog or conference attention.
The core principles of chaos engineering follow similar lines to those you followed in school or university science classes:
Form a hypothesis.
Communicate to your team.
Analyze the results.
Increase the scope.
Early in the lifetime of chaos engineering at Netflix, most engineers thought chaos engineering was about "breaking things in production," and it is in part. But while breaking things is great fun, it's not a useful activity unless you learn something from it.
These principles encourage you to introduce real-world events and events you expect to be able to handle. I wonder if fully embracing the "chaos" might result in more interesting results, ie, measuring the worst that could happen. True randomness and extremity could surface even more insightful results and observations.
Let's look at each of these steps in more detail.
1 - Form a hypothesis
To begin, you want to make an educated guess about what will happen in which scenarios. The key word here is "educated"; you need to gather data to support the hypothesis that you'll share with your team.
Decide on your steady state
What is "steady" depends on your application and use case, but decide on a set of metrics that are important to you and what variance in those metrics is acceptable. For example:
When completing checkout, the majority of customers should have a successful payment processed.
Users should experience latency below a particular rate.
A process should complete within a time frame.
When deciding on these metrics, also consider external factors such as SLAs and KPIs for your team or product(s).
Introduce real-world events
The sorts of events to test vary depending on your use case, but common to most applications are:
Running out of CPU, memory, or storage space
Run in production
"Testing in production" has long been a tongue-in-cheek reference to an untested code base, but as chaos engineering is likely run in collaboration with a properly pre-tested code base, it takes on a different meaning.
The principles we're working with here encourage you to undertake tests in production, or if you have a genuine reason for not doing so, as close as possible. Chaos engineering principles are designed to identify weakness, so they argue that running in production is fundamentally a good thing.
Some banks are already following these principles, and while engineers behind safety-critical systems should be confident of their setup before embarking on chaos engineering, the principles also recommend you design each experiment to have minimal impact and ensure you can abort at any time.
While the most tempting hypothesis is "let's see what happens" (much like "let's just break things"), it's not a constructive one. Try to concoct a hypothesis based on your steady state, for example:
If PayPal is unavailable, successful payments will drop by 20 percent.
During high traffic, latency will increase by 500ms.
If an entire AWS region is unavailable, a process will take 1 second longer to complete.
2 - Communicate to your team
As a technical communicator, this is perhaps the most important step to me. If you have a team of engineers running experiments on production systems, then relevant people (if not everyone) deserve to know. It's easy to remember engineers, but don't forget people who deal with the public, too, such as support and community staff who may start receiving questions from customers.
3 - Run your experiments
The way you introduce your experiments varies, some from code deployments, others by injecting calls you know will fail, or simple scripts. There are myriad tools available to help simulate these; I've provided links to find them below.
Make sure you have alerting and reporting in place to stop an experiment if needed, but also to analyze results later.
4 - Analyze the results
There's no point in running an experiment if you don't take time to reflect on what data you gathered and to learn from it. There are many tools you probably already use to help with this stage, but make sure you involve input from any teams whose services were involved in the experiment.
5 - Increase the scope
After defining your ideal metrics and the potential effects on them, it's time to start testing your hypothesis. Much like other aspects of modern software development, be sure to iterate these events, changing parameters or the events you test for.
Once you've tried one experiment, learned from it, and potentially fixed issues it identified, then move on to the next one. This may be introducing a new experiment or increasing the metrics of an existing one to find out where a system really starts to break down.
6 - Automate the experiments
The first time(s) you run an experiment, manually is fine -- you can monitor the outcome and abort it if necessary. But you should (especially with teams that follow continuous deployment) automate your experiments as quickly as possible. This means that the experiment can run when new factors are introduced into an application, but it also makes it easier to change input parameters for the scope of your experiments.
Again, the resources section below lists places to find tools to help with this.
While engineers and developers are divided on the usefulness of chaos engineering, the most interesting aspects to me are not the technical ones, but rather that it tests and checks ego.
The principles state in many places that if you are truly confident in your application, then you shouldn't fear what it proposes. They force you to put your money where your mouth is and (albeit in a careful and controlled way) prove your application is as confident as you are. I can imagine many insightful debriefing sessions after a chaos engineering experiment.
Tools and Resources
The free O'Reilly book on Chaos Engineering
The comprehensive Chaos Engineering awesome list that features a plethora of useful tools and resources
Gremlin, whose staff are often behind a lot of the Netflix-independent chaos engineering resources, runs a 'Failure-as-a-service' platform that commoditizes many of the tools and practices featured in this post