How chaos engineering can help DevSecOps teams find vulnerabilities

DevOps teams have used chaos engineering concepts to find software bugs for years. Tools are now available to help identify security flaws, too.

The words “chaos” and “engineering” aren’t usually found together. After all, good engineers keep chaos at bay. Yet lately software developers are deploying what they loosely call “chaos” in careful amounts to strengthen their computer systems by revealing hidden flaws. The results aren’t perfect – anything chaotic can’t offer guarantees– but the techniques are often surprisingly effective, at least some of the time, and that makes them worthwhile.

The process can be especially useful for security analysts because their job is to find the undocumented and unanticipated backdoors. The chaotic testing can’t identify all security failures, but it can reveal dangerous, unpatched vulnerabilities that were not imagined by the developers. Good chaos engineering can help both the DevSecOps and DevOps teams because sometimes problems of reliability or resilience can also be security weaknesses, too. The same coding mistakes can often either crash the system or lead to intrusions.

What is chaos engineering?

The term is a neologism meant to unify different techniques that have found some success. Some use words like “fuzzing” or “glitching” to describe how they tweak a computer system and, perhaps, send it off balance and maybe crash it. They inject random behavior that can stress the software while watching carefully for malfunctions or bugs to appear. These are often failure modes that might take years to reveal themselves in regular usage.

John Gilmore, one of the founders of the Electronic Freedom Foundation (EFF) and the member of the development team behind several key open-source projects, says that coding is a process of continual refinement and chaos engineering is one way to speed up the search for all possible execution paths. “The real value of long-running code is that most of the bugs have been shaken out of it by the first 10 million people to run it, the first 20 compilers that have compiled it, the first five operating systems it runs on. Ones that have then been tested by fuzzing and penetration tests (e.g., Google Project Zero) have many fewer unexplored code paths than any new piece of code.” he explains.

Gilmore likes to tell a story from the 1970s of when he worked for Data General, an early minicomputer manufacturer. He found that flipping a power switch at random times would leave the operating system state in disarray. “Rather than fixing the problem, the operating system engineers claimed that flipping the breaker wasn't a valid test.” Gilmore says, before adding,   “As a result, Data General is dead now.”

The idea isn’t new to computer manufacturing or to other fields of engineering. Car manufacturers, for instance, test new models in deserts in summer and the northern regions in winter. Architects build test structures and overload them watching for failure.

Computer science, though, has been a relatively mathematical field. Many security researchers create elaborate logical proofs that offer all the certainties of good mathematics. The complexity of modern software, though, is much larger than our logical tools are capable of modeling. Most areas of computer security are far from understood with the precision that we like and that opens the door for the random acts.

The name deliberately avoids the word “science” and all the traditions of building, testing and eventually understanding the world through carefully built and curated models. Even using the word “engineering” isn’t exactly fair because engineering is often just as rigorous, planned and methodical as what happens in science labs. Chaos engineering is closer to releasing a bull in a china shop or letting loose a greased pig in the high school cafeteria.

Chaos engineering techniques

The techniques used by chaos engineers are often maddeningly simple but surprisingly devious. They involve bending and distorting the normally protective home of the software by subverting many of the services that the programmers took for granted. One simple test, for instance, simply deletes half of the data packets coming through the internet connection. Another might gobble up almost all the free memory so the software is scrambling for places to store data.

The tests are often done at a higher level. DevSecOps teams may simply shut down some subset of the servers to see if the various software packages running in the constellation are resilient enough to withstand the failure. Others may simply add some latency to see if the delays trigger more delays that snowball and eventually bring the system to its knees.

Almost any resource such as RAM, hard disk space, or database connections is fair game for experimentation. Some tests cut off the resource altogether and others just severely restrict the resource to see how the software behaves when squeezed.

Security flaws are often revealed indirectly. Buffer overflow problems, for instance, are relatively easy for chaos tools to expose by injecting too many bytes into a channel. The tools may not actually break into the software, but they reveal where someone else might exploit this buffer overflow to inject malicious code.

Fuzzing is also adept at revealing flaws in parsing logic. Sometimes programmers neglect to anticipate all the different ways that the parameters can be configured, leaving a potential backdoor. Bombarding the software with random and semi-structured inputs can trigger these failure modes before attackers find them.

The area has also grown more sophisticated. Some researchers moved beyond strictly random injection and built sophisticated fuzzing tools that would use knowledge of the software to guide the process using what they often called “white box” analysis. One technique called grammatical fuzzing would begin with a definition of the expected data structure and then use this grammar to generate test data before subverting the definition in hope of identifying a parsing flaw. Deeper strategies can systematically try to identify all possible execution paths in the code. Microsoft, for instance, built a tool called SAGE that it used to flag potential errors by looking at the potential branches and creating inputs that will test them all. 

A challenge for any chaos engineer is detecting flaws that might be revealed during the extreme loads. While total shutdowns are usually easy to spot, it can be harder to see when less glaring failures lead to subtle security flaws. Many of the problems that might be uncovered might not compromise any data or access, but they could still reveal issues that should be addressed and fixed.

Chaos engineering tools

The area is rapidly going from a secret sauce deployed by smart DevSecOps teams to a regular part of the development cycle. The tools that began as side projects and skunkworks experimentation for engineers, and now growing into trusted parts of many CI/CD pipelines.  Many of the tools are staying open-source projects produced by other DevSecOps specialists and shared openly. Others are attracting commercial attention. A few dedicated companies are supplying proprietary tools to a marketplace that is expanding rapidly.

Some tools are also designed to operate more deeply inside of software stacks. For example, ChaosMachine, a research tool developed at the KTH Royal Institute of Technology in Sweden, injects false exceptions into byte code running on Java virtual machines. These stress the error-handling mechanisms that should be written into the code already.

Dozens of good chaos engineering tools are usually focused on a particular language or platform. Pythonfuzz, for instance, repeatedly calls functions with random data while watching for memory leaks, deadlocks or other failures. Google’s OSS-Fuzz works with multiple languages and projects and the company uses them to test open-source contributions to its ecosystem like Chrome extensions.

Other tools are focused on particular platforms. Netflix was one of the pioneers in the area and it created a collection of tools to test its infrastructure, which relies heavily on Amazon Web Services. One of the earliest tools, Chaos Monkey, randomly reaches into machine clusters and terminates some to see if the total failure of one instance leads to any problems. Netflix has built similar tools that it calls the Simian Army. It includes Latency Monkey, Chaos Gorilla and Chaos Kong that can slow down networks or even shut down entire collections of machines.

Most other platforms have tools focused on sowing trouble. Proofdock’s Chaos Platform, for instance, targets Azure. The Google Cloud Chaos Monkey is a version of the original rewritten to work with the GCP API. 

Some projects are designed to share ideas and code across many platforms. One open-source project called Litmus targets all the major clouds and machines running on-premises. The Litmus platform supports “ChaosExperiments” that can target software running in different clouds. Its ChaosEngine deploys the experiments and tracks the results.

Many tools are meant to reach across team boundaries and unite both developers testing code in early stages with DevOps teams that manage CI/CD pipelines and watch over software running in production. The ChaosToolkit, for instance, is an open-source project designed to integrate with any build pipeline to add more complex and chaotic challenges for the new code. It also relies on a large collection of drivers and plugins to work in many different clouds and on-premises installations.

The software development tool vendor Cavisson has a product called “NetHavoc” that injects faults called “havocs” that can expose software failures including security vulnerabilities. One havoc will corrupt packets and distort the DNS results. Others will starve the application of memory or disk space.

“You can kill a server, you can terminate an instance, you can teleport and distort messages on your message queues.” explains Mrigank Mishra, a product manager at Cavisson. “At the end of the day, it all boils down to the use case that the organization is looking at.”

Some tools dig deeper because the problems occur inside the code when input strays away from expectations. Cryptofuzz, for instance, works directly with some cryptography libraries responsible for encrypting SSL connections. It watches for security problems like crashes, memory leaks, buffer overflows and uninitialized variables.

Chaos engineering expanding

The area continues to grow. Many of the open-source options are expanding and spawning new versions filled with enhancements. Some companies are beginning to fit chaos creating options into their development tools, testing suites and security audits. The area is expanding because it’s not defined in a concrete way– indeed it’s more a collection of techniques for doing the wrong thing at the wrong time. 

On its face, chaos engineering is easy. Just flip switches and tweak data until something breaks. But the real art is looking for the right places to fiddle around and start adding noise. Chaos engineers aren’t so much engineers as devious jerks and malicious pranksters with a mean streak. They would normally be either ignored or shown the door in polite offices but their ability to find the weak spots, the Achilles heels requires embracing anti-social attitudes that deliberately undermines and subverts the assumptions the developers made when writing the first drafts.  If a bit of bad behavior can reveal the flaws early in the process before code goes into production, everyone wins.

Copyright © 2022 IDG Communications, Inc.

22 cybersecurity myths organizations need to stop believing in 2022