What is security chaos engineering and why is it important?

Chaos engineering is, at its root, all about stress testing software systems in order to minimize downtime and maximize resiliency. Security chaos engineering takes these principles forward into the domain of security.

The central argument of security chaos engineering is that current security practices aren’t fit for purpose. “Despite spending more on security, data breaches are continuously getting bigger and more frequent across all industries” write Aaron Rinehart and Charles Nwatu in a post published on opensource.com in January 2018. “We hypothesize that a large portion of data breaches are caused not by sophisticated nation-state actors or hacktivists, but rather simple things rooted in human error and system glitches.”

The rhetorical question they’re asking is clear: should we wait for an incident to happen in order to work on it? Or should we be looking at ways to prevent them from happening at all?

Why do we need security chaos engineering today?

There are two problems that make security chaos engineering so important today.

One is the way in which security breaches and failures are understood culturally across the industry. Security breaches tend to be seen as either isolated attacks or ‘holes’ within software - anomalies that should have been thought of but weren’t.

In turn, this leads to a spiral of failures. Rather than thinking about cybersecurity in a holistic and systematic manner, the focus is all too often on simply identifying weaknesses when they happen and putting changes in place to stop them from happening again.

You can see this approach even in the way organizations communicate after high-profile attacks have taken place - ‘we’re taking steps to ensure nothing like this ever happens again.’ While that sentiment is important for both customers and shareholders to hear, it also betrays exactly the problems Rinehart, Wong and Nwatu appear to be talking about.

The second problem is more about the nature of software today. As the world moves to distributed systems, built on a range of services, and with an extensive set of software dependencies, vulnerabilities naturally begin to increase too. “Where systems are becoming more and more distributed, ephemeral, and immutable in how they operate… it is becoming difficult to comprehend the operational state and health of our systems' security,” Rinehart and Nwatu explain.

When you take the cultural issues and the evolution of software together, it becomes clear that the only way cybersecurity is going to properly tackle today’s challenges is by doing an extensive rethink of how and why things happen.

What security chaos engineering looks like in practice

If you want to think about what the transition to security chaos engineering actually means in practice, a good way to think about it is seeing it as a shift in mindset. It’s a mindset that doesn’t focus on isolated issues but instead on the overall health of the system.

Essentially, you start with a different question: don’t ask ‘where are the potential vulnerabilities in our software’ ask ‘where are the potential points of failure in the system?’

Rinehart and Nwatu explain: “Failures we can consist not only of IT, business, and general human factors but also the way we design, build, implement, configure, operate, observe, and manage security controls. People are the ones designing, building, monitoring, and managing the security controls we put in place to defend against malicious attackers.”

By focusing on questions of system design and decision making, you can begin to capture security threats that you might otherwise miss. So, while malicious attacks might account for 47% of all security breaches, human error and system glitches combined account for 53%. This means that while we’re all worrying about the hooded hacker that dominates stock imagery, someone made a simple mistake that just about any software-savvy criminal could take advantage of.

How is security chaos engineering different from penetration testing?

Security chaos engineering looks a lot like penetration testing, right? After all, the whole point of pentesting is, like chaos engineering, determining weaknesses before they can have an impact. But there are some important differences that shouldn’t be ignored.

Again, the key difference is the mindset behind both. Penetration testing is, for the most part, an event. It’s something you do when you’ve updated or changed something significant. It also has a very specific purpose. That’s not a bad thing, but with such a well-defined testing context you might miss security issues that you hadn’t even considered.

And if you consider the complexity of a given software system, in which its state changes according to the services and requests it is handling, it’s incredibly difficult - not to mention expensive - to pentest an application in every single possible state.

Security chaos engineering tackles that by actively experimenting on the software system to better understand it. The context in which it takes place is wide-reaching and ongoing, not isolated and particular.

ChaoSlingr, the security chaos engineering tool

ChaoSlingr is perhaps the most prominent tool out there to help you actually do security chaos engineering. Built for AWS, it allows you to perform a number of different ‘security chaos experiments’ in the cloud. Essentially, ChaosSlingr pushes failures into the system in a way that allows you to not only identify security issues but also to better understand your infrastructure. This SlideShare deck, put together by Aaron Rinehart himself, is a good introduction to how it works in a little more detail.

Security teams have typically always focused on preventive security measures. ChaosSlingr empowers teams to dig deeper into their systems and improve it in ways that mitigate security risks. It allows you to be proactive rather than reactive.

The future is security chaos engineering

Chaos engineering has not quite taken off - yet. But it’s clear that the principles behind it are having an impact across software engineering. In particular, at a time when ever-evolving software feels so vulnerable - fragile even - applying it to cybersecurity feels incredibly pertinent and important.

It’s true that the shift in mindset is going to be tough. But if we can begin to distrust our assumptions, experiment on our systems, and try to better understand how and why they work the way they do, we are certainly moving towards a healthier and more secure software world.

Chaos Conf 2018 Recap: Chaos engineering hits maturity as community moves towards controlled experimentation

Chaos engineering platform Gremlin announces $18 million series B funding and new feature for “full-stack resiliency”

Gremlin makes chaos engineering with Docker easier with new container discovery feature