Cloud and DevOps
If It Ain’t Broke, Break It: Reflections From the Twin Cities Chaos Day
A look at Chaos Engineering and the cultural shift of experimenting with production failure in order to build and design more resilient systems.
Recently I was fortunate to attend Chaos Day courtesy of my old pal David Hussman of DevJam Studios. This invite-only, one-day event brought together members of a small but growing community of software engineers in the wonderfully hip basement of Studio 2 in chilly Minneapolis.
Throughout the day, engineers from under-the-radar start-ups and household names like Netflix and Thomson Reuters shared stories of introducing Chaos Engineering into their organizations. As the day went on, I was struck by the feeling that I was somewhere special at the right time, surrounded by some of the leading talent in our industry. Inspired, I felt compelled to share my thoughts.
What is Chaos Engineering?
Chaos Engineering is "the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production." Chaos Engineering is squarely focused on helping teams design, build and run systems that are highly resilient. Despite their extraordinary uptime, services from AWS and Azure can—and do—fail. Data centers lose power. The goal of Chaos Engineering is to ensure you can better understand how to design and build apps and services that can withstand havoc while delivering the optimal customer experience.
Chaos practitioners are more likely to be from the ops side of the house. It’s tempting to think of Chaos as "just another form of testing." But practitioners argue that while testing is binary and simply asserts (or disproves) what we think we already know, Chaos is about creating experiments that help us learn more about a system. It helps us re-calibrate our mental model of how things really work and—by closing the feedback loop—helps us design and build more resilient systems.
Chaos Engineering in Practice
A pre-condition to starting Chaos is defining "normal" for your app so that any deviations can be detected via your monitoring tools. Once you’ve done that, you can start running experiments. The most popular types include:
1. Inject a stack fault to see how your system responds during normal traffic. This exposes what happens when you lose anything from a service to an Availability Zone.
2. Inject a request failure to artificially introduce latency, exceptions or other abnormal behavior from components as they process a modified request.
3. Send an overwhelming number of requests to your app or service to see how resilient it is. As with a DoS or DDoS attack, the requests involved may be valid or faulty.
These experiments take many forms. Game Days, similar to military war games, pit a red team creating failures against a blue team monitoring the apps or services to see how they are impacted. Game Days can occur at certain project milestones or on a regular cadence, such as every Friday. A more consultative approach involves an ops team working with selected dev teams to inject failures into an app or service during working hours. This is done in close coordination in the early stages, but as teams mature, the failures may come without warning. Surprise!
Netflix engineers take it one step further: They’ve created automation tools that randomly inject different types of faults and latency into their systems to continually test resiliency. Since the release of Chaos Monkey in 2011, Netflix has developed an entire Simian Army to build confidence in their ability to recover from failure. Nora Jones, a senior software engineer at Netflix and Chaos Day speaker, explained that Chaos experiments are automatically enabled for new services.
Introducing Chaos Engineering to your Organization
Ready to start Chaos-ing tomorrow? If you are like most of my customers, I suspect not. This is not because injecting faults is hard – with proper credentials this can be done in a matter of seconds. But doing it in a way that maximizes the learning while minimizing the risks and pain takes experience. If you’re serious about bringing Chaos Engineering principles and practices into your organizations, here are five things to consider:
1. The purposeful tampering with production is a huge culture shift.
Getting buy-in from stakeholders to purposefully break working things in a live production environment is counter to operations’ mission. Since the days of mainframes, production environments have been sacred places where things change gradually and very, very deliberately — all under the watchful eye of an ops team charged with keeping things running smoothly.
Screwing with this carries lots of risks. As Kent Beck mentioned during his lightning talk, it’s one thing if resiliency events occur naturally; it's an entirely different story if these events are self-inflicted, especially if they impact customers. Someone is going to be held accountable, and not many technical leaders I know would sign up for this today. Risk aversion and accountability run deep in nearly every organization, creating an enormous obstacle to injecting Chaos in production. Sure, you can practice Chaos in non-production environments, and sure, you'll see some benefits. But experimenting with live customers is the true litmus test and it will require bold leaders to take this next step.
2. Selling resiliency to product owners will be challenging.
If you want to start experimenting with Chaos Engineering in your organization, you’ll have to sell the idea to someone who prioritizes work for teams. This is often a product owner or a leader who probably doesn’t get the same satisfaction you do from breaking things to see what happens! While better availability and resiliency sound great, unless your app is prone to crashing, getting approval to divert investments away from new features will be difficult.
Metrics can help. Start by quantifying the “additional cost or lost revenue per minute of downtime.” Look for qualitative feedback on how downtime impacts customer experience. If you don't have this data today, you have some homework to do. An alternative approach mentioned during the day is to skip selling it the product owner entirely: just "bake it in" to the underlying operations costs that are charged back internally.
3. Chaos Engineering makes sense for complex distributed systems, but what about everything else?
It's clear that Netflix benefits from Chaos Engineering. And I can see why Amazon, Google and Facebook benefit, too. They run very distributed, complex and highly available systems with millions of customers. But honestly, what percentage of your apps are complex and distributed? Maybe 20%? So what about the other 80%? Does Chaos provide a good return on investment for these?
When viewed through a per-app lens, the answer is likely no. I’ve seen enterprise apps with a few dozen captive users that run on a single server. If it’s down for a day, it’s not the end of the world. It’s hard to make the argument that Chaos makes sense in these situations.
But to paraphrase Bjarne Stroustrup, your organization itself is a “complex distributed system that runs on software”. This includes the 20% and the 80%. From this perspective, the benefits of Chaos Engineering may improve the whole organization's resiliency, not just the complex and distributed apps.
4. When are we ready to adopt Chaos Engineering?
Chaos pioneers boast micro-service architectures, cloud infrastructure, continuous delivery and world-class talent. Does this describe your organization, too? Probably not. So how do you get started?
MVP is an engineer with proper credentials and a desire to break things. But knocking things over to see what happens may not endear you to the dev teams and may ultimately slow adoption. Close collaboration between development and operations teams is critical to the success of Chaos, so start there. Begin with simple experiments in non-production environments and gradually work your way up from there.
5. Will the move to serverless architectures reduce the need for Chaos Engineering?
When you operate virtual servers and containers, there's a broad surface area for injecting faults in networks, compute and storage resources. Running your own data center provides even more opportunities. As serverless architectures gain momentum, your ability to inject faults into the lower layers of your stack is greatly diminished. Your cloud provider may be Chaosing their own services, but they’re not going to let you do that for them!
If the surface area for serverless is smaller, is the need for Chaos Engineering reduced? Not necessarily. Two of the three experiment types I explained earlier are still valid options. Sure, serverless presents fewer opportunities to inject faults in the stack, but that’s part of the tradeoff. If you really want the ability to independently and continually verify resiliency at a much lower level, you’ll have to rethink your architecture.
On the flight home, I reflected on all that I’d seen and heard at Chaos Day. It was energizing to be among a community of elite engineers who solve very challenging problems. It reminds me of the early days of Extreme Programming (XP) community nearly 20 years ago, before "Big A" Agile and the Scaled Frameworks took over. XP started as a software engineering movement to work in a saner way. But it gradually gave way to Agile and project management, certifications, big planning ceremonies and "the enterprise.”
There is no question—Agile has lifted all our ships. I've trained and coached Agile for years. But we collectively lost something along the way. While we've learned how to apply Agile to every conceivable problem, the focus on solid engineering practices with “just enough” process has sadly faded away.
As the Chaos community continues its inevitable growth and evolution, my hope is that it stays true to the culture I observed last week: a friendly, creative community of strong problem solvers with a focus on advancing the discipline of software engineering.
Learn more about our Cloud and DevOps solutions.