Data Breaches: Patterns and Their Implications
What can we learn from statistical analysis of data breaches? Luther Martin digs in.
By Luther Martin, Voltage Security
September 08, 2009 — CSO — One problem that every information security organization faces is how to accurately quantify the risks that they manage. In most cases, there is not enough information available to do this, but there is now enough known about data breaches to let us draw interesting conclusions, some of which may even have implications in other areas of information security.
Measuring riskThe risk associated with an event is traditionally defined to be the average loss that the event causes. This is the basis for the annual loss expectancy (ALE) approach to risk management. Almost everyone in the field of information security learns about the ALE approach. It's even part of the CISSP Common Body of Knowledge. On the other hand, almost nobody actually uses it.
The ALE approach tells us to calculate the risk of an event by multiplying the loss that you get when the event happens and the probability of the event happening. For example, an event that causes $1 million in loss and has a 1 in 1,000 chance of happening each year represents $1,000 in risk per year, which we get by multiplying the $1 million and the probability of 1 in 1,000. (Editor's note: See Bruce Schneier's analysis of ALE in Security ROI: Fact or Fiction?)
The biggest problem with using the ALE approach in information security is that we usually know neither the loss that a particular event causes nor the chances of the event happening. The chances are probably very good that any web server has an exploitable weakness that security researchers haven't found yet, but what are the chances of one of these bugs being found in the next year? And if one is found, exactly how do you measure the damage caused by hackers exploiting it? It's hard to get good answers to questions like these, so traditional risk-management approaches have found very limited uses in information security.
In the case of data breaches, however, there is now lots of data available. The Open Security Foundation (OSF) (http://datalossdb.org/) has done an excellent job of tracking down the details of over 2,000 data breaches, and their data shows some interesting patterns. Here's a graph that shows of the history of data breaches since January 1, 2006.
It's not easy to see a pattern in this graph, but if we take logarithm of the data, we find that it's a very good match to data that you'd expect from a normal distribution, what's sometimes called a "bell curve." In particular, the logarithm of the size of data breaches is a very good match to data that has a normal distribution with a mean of 3.4 and a standard deviation of 1.2. This is shown in following graph that compares the observed sizes of data breaches with what the normal distribution predicts.
There is definitely a pattern in the number of records exposed by data breaches, and it follows a so-called lognormal distribution. It doesn't follow the normal distribution, but its logarithm does.
Interpreting this information
Knowing that the size of data breaches follows a lognormal distribution tells us some of the things that we'd like to know, but it still doesn't tell us everything that we'd like to know. This is much like what knowing the probabilities associated with flipping a coin tell us.