What can we learn from statistical analysis of data breaches? Luther Martin digs in.

One problem that every information security organization faces is how to accurately quantify the risks that they manage. In most cases, there is not enough information available to do this, but there is now enough known about data breaches to let us draw interesting conclusions, some of which may even have implications in other areas of information security.

### Measuring risk

The risk associated with an event is traditionally defined to be the average loss that the event causes. This is the basis for the annual loss expectancy (ALE) approach to risk management. Almost everyone in the field of information security learns about the ALE approach. It’s even part of the CISSP Common Body of Knowledge. On the other hand, almost nobody actually uses it.

The ALE approach tells us to calculate the risk of an event by multiplying the loss that you get when the event happens and the probability of the event happening. For example, an event that causes $1 million in loss and has a 1 in 1,000 chance of happening each year represents $1,000 in risk per year, which we get by multiplying the $1 million and the probability of 1 in 1,000. *(Editor’s note: See Bruce Schneier’s analysis of ALE in Security ROI: Fact or Fiction?)*

The biggest problem with using the ALE approach in information security is that we usually know neither the loss that a particular event causes nor the chances of the event happening. The chances are probably very good that any web server has an exploitable weakness that security researchers haven’t found yet, but what are the chances of one of these bugs being found in the next year? And if one is found, exactly how do you measure the damage caused by hackers exploiting it? It’s hard to get good answers to questions like these, so traditional risk-management approaches have found very limited uses in information security.

In the case of data breaches, however, there is now lots of data available. The Open Security Foundation (OSF) (http://datalossdb.org/) has done an excellent job of tracking down the details of over 2,000 data breaches, and their data shows some interesting patterns. Here’s a graph that shows of the history of data breaches since January 1, 2006.

It’s not easy to see a pattern in this graph, but if we take logarithm of the data, we find that it’s a very good match to data that you’d expect from a normal distribution, what’s sometimes called a “bell curve.” In particular, the logarithm of the size of data breaches is a very good match to data that has a normal distribution with a mean of 3.4 and a standard deviation of 1.2. This is shown in following graph that compares the observed sizes of data breaches with what the normal distribution predicts.

There is definitely a pattern in the number of records exposed by data breaches, and it follows a so-called lognormal distribution. It doesn’t follow the normal distribution, but its logarithm does.

### Interpreting this information

Knowing that the size of data breaches follows a lognormal distribution tells us some of the things that we’d like to know, but it still doesn’t tell us everything that we’d like to know. This is much like what knowing the probabilities associated with flipping a coin tell us.

If we flip a coin 100 times, for example, we expect to get about 50 “heads,” although we can’t predict exactly which flips will come up “heads” or exactly how many of the flips will come up that way. We can also accurately estimate bounds for how many “heads” we’ll see. It lets us calculate the probability of having more than 70 “heads” out of our 100 flips, for example.

Similarly, knowing that the size of data breaches follows a lognormal distribution doesn’t let us predict when the next data breach will occur or how many records will be exposed when it happens, but it does let us predict what we’ll see over a period of weeks or months. It lets us predict how often we’ll see a data breach that exposes 1 million or more records, for example, or it lets us estimate the chances of at least one data breach happening in the next year that exposes 1 million records or more, and that’s very useful information to have.

The fact that the size of data breaches follows a lognormal distribution might even tell us something about the way in which breaches happen. We get a lognormal distribution when the value that we observe is the result of the multiplication of one or more values that have a random component. Because this is true, it’s natural to think that data breaches are caused by the failure of one of more security mechanisms, the effects of which multiply together to get the total effect of the breach.

This is actually a reasonable model, but lognormal distributions also occur in many other situations, and some of these can’t really be interpreted this way. In particular, the following list of values follows a lognormal distribution (Eckhard Limpert, Werner Staehel and Markus Abbt, “Log-normal Distributions across the Sciences: Keys and Clues,” BioScience, May 2001, Vol. 51, No. 5, pp. 341-352, available at http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf), and in some of these cases it’s not clear that there’s some sort of multiplication of effects happening.

- The concentration of gold or uranium in ore deposits
- The latency period of bacterial food poisoning
- The age of the onset of Alzheimer’s disease
- The amount of air pollution in Los Angeles
- The abundance of fish species
- The size of ice crystals in ice cream
- The number of words spoken in a telephone conversation
- The length of sentences written by George Bernard Shaw or Gilbert K. Chesterton

So although thinking of data breaches as being caused by one or more security failures that multiply in effect may actually be a useful point of view, there are also cases where the lognormal distribution appears that don’t seem to have a meaningful connection to this model.

### Benford’s law

It turns out that the size of data breaches also follows Bedford’s law (http://www.benfordonline.net/).

The initial digits of data that follows Benford’s law aren’t equally likely. Smaller initial digits are more likely than larger ones. Starting with a ‘1’ is the most common and happens about 30 percent of the time. Starting with a ‘9’ is the least common and happens about 5 percent of the time.

Not all data follows Benford’s law, but the size of data breaches does, as the following graph shows.

Data that comes from repeated multiplications tends to follow Benford’s law. To convince yourself that this is true, try tracking the growth of an investment with a 10 percent compound annual growth rate over a 30-year period. If you do this, you’ll see that the value of the investment follows Benford’s law. Because the size of data breaches also follows Benford’s law, this might lead us to believe that the idea that data breaches are caused by a series of security failures that multiply in effect is actually a reasonable one.

### Using this information

Now that we know that some aspects data breaches are predictable, how can we use this information? One way is to measure the effectiveness of industry-wide efforts to reduce data breaches. We might see the mean of the logarithm of data breach sizes decrease from 3.4 to 3.0 over time, for example. If this happens, then we have evidence that we’re winning the fight against breaches. If this number increases, on the other hand, that’s evidence that we’re losing.

The idea that it’s possible to model information security as a set of one or more mechanisms whose effects multiply together is also one that might have further applications. There’s not much data available for security incidents other than data breaches, but we shouldn’t be too surprised if we were to learn that the damage from other security incidents also follows a lognormal distribution. Maybe we’ll have enough data one day to see if that’s actually true.

What we can learn from the information that’s available about data breaches doesn’t quite get us to the point where we can use traditional risk management methodologies, but it gets us closer than we’ve ever been in the past. Don’t be surprised if we learn enough about other security incidents in the next few years to make the ALE approach useful in the field of information security. Understanding the patterns in data breaches is a good first step in that direction. ##

*Luther Martin is author of Introduction to Identity-Based Encryption (Information Security and Privacy Series); the IETF standards on IBE algorithms and their use in encrypted e-mail; and numerous reports and articles on varied information security and risk management topics. He is interested in pairing-based cryptography, and the business applications of information security and risk management. Martin is currently chief security architect at Alto, CA-based Voltage Security, Inc. (www.voltage.com). *