Data breaches have a lognormal distribution

In many parts of information security, there's very little reliable data that you can use to help you make decisions. For data breaches, however, there's a fair amount of data available, and you can get a database of almost 2,000 data breaches from the Open Security Foundation at their web site datalossdb.org. There's lots of information in this database, and it shows some interesting patterns. In particular, here's what you see if you plot the number of records ("TotalAffected" in the OSF database) compromised by each breach from 2006 to the present. It's hard to see a pattern in this data.

Figure 1   

On the other hand, if we make a histogram of the logarithm of the data, we get the following graph. That certainly looks like data from a normal distribution, so that the size of data breaches follows a lognormal distribution fairly closely, which is the blue line in this graph. That's a fairly good fit, isn't it?

Figure 2

  • Patrick Florer

    @Luther
    Approximately 25-30% of the breaches detailed in dataloss.db have null in the TotalAffected field, including the one for Heartland Payment Systems.
    So, the best you can say is that, based upon the 75% of breaches where the total number of records has been captured, the data can be graphed as you have shown.
    Given the poor quality of the data we have to work with, the number of breaches that we don’t know about, the number of breaches for which records lost is not known or not reported, and the large skewing effect the the Heartland event could have, about all you can say is that it’s a pretty graph.
    It’s not obvious to me what other conclusions you can draw.
    I supposed that you could assume that the missing data conform to this same pattern, and that there is an underlying dynamic working, but there is no particular reason to believe that instead of believing anything else.
    A more useful idea, to my mind, is the spread between the mean and median number of records breached.
    Both the Verizon 2009 DBIR and the dataloss.db data show an approximate 100x difference between the two: Verizon 5.5 million vs 38,000; and dataloss: 333,400 vs 4,300 (thru 2/6/09). If you assume a big number for Heartland, then the dataloss numbers will move even closer to the ones from Verizon.
    The takeaway from this comparison is that, while there are a few very large breaches, they are outliers – the majority of breaches are relatively small, though painful, I am sure.
    Patrick Florer
    Dallas, Texas

    Reply

  • Luther Martin

    While it’s true that many of the records in the OSF database have a 0 for TotalAffected that indicates that the number of records exposed hasn’t yet been determined, there’s still enough data to get statistically significant results. And you can use tests like a Shapiro-Wilk test or a Kolmogorov-Smirnov test with this data and show that it really does follow a lognormal model. That’s what I think is interesting (if not actually significant).
    In what other part of information security can you find an accurate model of incidents? I don’t know of any other place where you can do this.
    A lognormal distribution is often interpreted as coming from the product of several independent normal distributions. If we use this interpretation, we might end up thinking that breaches are caused by one or more controls failing, with the bigger breaches being caused by the failure of multiple controls. That might give you an insight into how to prevent large breaches.
    You also might be able to use the fact that the data seems to fit a particular distribution to measure how effective industry-wide efforts to reduce breaches are. If we’re doing our job, we should be able to see this reflected in something like a statistically significant reduction in the logmean over time, for example.
    I also find the symmetry of the distribution interesting. For each large breach, there’s a breach that’s equally small. So for each Heartland there’s also a laptop lost that exposes 1 or 2 records. I wouldn’t have expected that before looking at the data more closely.

    Reply

  • Patrick Florer

    Dear Luther,
    I am not questioning the fact that the non-null data produces a lognormal graph; nor that this non-null dataset graphs in a way that is non-random at some level of statistical significance.
    What I am questioning is the underlying reality of the situation, given the large number of breaches where the number of missing records is unknown, unreported, or as yet to be determined.
    I would be interested to see three other variations of the lognormal graph – with estimates of the records breached at Heartland – say 1 million, 50 million, and 250 million. If each of these three graphs lined up with the one you have presented, then I think the case would be much stronger.
    The new Verizon DBIR reports 285 million records breached in 2008. Even though they don’t say it, based upon what we do know about other breaches, it’s very likely that the Verizon total includes a very large number for Heartland.
    The dataloss.db database is a very useful source of data – what worries me is that graphics like this are often accepted as gospel truth without further critical analysis.
    I enjoy your blog!
    Thanks,
    Patrick

    Reply

  • Luther Martin

    Patrick,
    If we assume that the Heartland breach exposed a very large number of records, it actually makes the data fit the lognormal model even better.
    Let’s assume that the Heartland breach exposed 50 million records. That would give us a single additional data point out at 7.7 on the axis Log10_TotalAffected. If we up that estimate to 250 million, that gives us a single data point at 8.3 on this axis.
    The reason that this actually makes it fit the model better is that the upper tail of the data doesn’t fit the model quite as precisely as the lower tail does, and adding a few points on the upper end makes the fit better. If you’re really interested, I can post a graph that compares the CDF of the model to the CDF of the data that shows this.
    Luthe

    Reply

  • Patrick Florer

    Luthe,
    (is this how you wish to be addressed?)
    Thank you – I would like to see the other graphs, and think that others might as well. Then again, I guess that they wouldn’t look that much different.
    Doing a what-if like this corroborates your initial analysis, and increases my confidence that this distribution could actually mean something.
    Patrick

    Reply

  • moncler jackets store

    I had always wanted to learn about this topic … I think it’s great the way you expose .. great work and continuing on with this great blog

    Reply

  • Martin135

    Hello,
    We facilitate the provision of independent analysis to support expert testimony, regulatory or legislative engagements.http://www.fidelityadvisory.com

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *