Problems with the Ponemon data breach studies

Ponemon 

The Ponemon data breach studies are one of the few sources of information that we have about data breaches, but their results may either overestimate or underestimate the true cost of a data breach because the breaches that are looked at in these studies aren't really representative of all breaches.

As I've noted before, the size of data breaches follows a lognormal distribution fairly closely. Historically this distribition has had a logmean (base 10) of about 3.4 and a logdeviation (base 10) of about 1.2. In other words, the base 10 logarithm of the breach size follows a normal distribution or "bell curve."

But when we look at the breaches that the Ponemon studies look at, the breaches don't seem to be representative of all breaches. The 2010 report (PDF), for example, looked at US breaches that exposed between 5,010 and 101,000 records. Here's what we get when we graph that range (of the log) of breach sizes:

Normal 
    

So it certainly looks like that range of breaches isn't really representative of all breaches. It only includes breaches that are above-average in size but aren't too big, and it only represents about 31 percent of all breaches.

The Ponemon reports claim that they're carefully tailored to be representative of companies that suffer data breaches. As their 2010 U.S. Cost of a Data Breach report said,

This benchmark study examines data breach costs resulting in the loss or theft of protected personal data. As a benchmark study, Cost of a Data Breach differs greatly from the standard survey study, which typically requires hundreds of respondents for the findings to be statistically valid. Benchmark studies are valid because the sample is designed to represent the population studied. They intentionally limit the number of organizations participating and involve an entirely different data-gathering process.

A more representative sample of breaches would also include companies that suffered breaches that are both much larger and smaller than those interviewed for the 2010 report. Because those breaches weren't considered in this report, there's a good chance that the report either overestimates or underestimates the true cost of data breaches. Maybe we'll find out which one in a future report.

  • Patrick Florer

    Hello, Luther –
    Nice observation.
    I assume that you are using datalossdb as the base dataset for your discussion, right?
    If so, I would agree that it’s the most comprehensive dataset we have to look at.
    But does that prove that the the population of all breaches follows a lognormal distribution just because datalossdb does?
    I don’t have the answer – but would be interested to hear your thoughts.
    Patrick Florer

    Reply

  • Patrick Florer

    Hello, Luther –
    Nice observation.
    I assume that you are using datalossdb as the base dataset for your discussion, right?
    If so, I would agree that it’s the most comprehensive dataset we have to look at.
    But does that prove that the the population of all breaches follows a lognormal distribution just because datalossdb does?
    I don’t have the answer – but would be interested to hear your thoughts.
    Patrick Florer

    Reply

  • Luther Martin

    Yes, that’s based on the datalossdb data.
    There are a couple of reasons why I believe that the population of breaches follows a lognormal distribution because the datalossdb population does.
    First, the fit to a lognormal distribution is very good from a fairly large sample. That’s not proof, of course, but it should be fairly convincing.
    And there seem to be reasons to believe that we should see that distribution for ANY security loss. A non-rigorous explanation is roughly that the success experienced by hackers multiplies as they penatrate different layers of security instead of adding. If their success added then we’d expect to see a normal distribution in overall losses, but because their success multiplies we should expect to see a lognormal one.

    Reply

  • Patrick Florer

    Thank you –
    Interesting proposition – I will chew on it.
    Here is additional corroboration for what you show in your post.
    I download datalossdb from time to time and do a bit of my own analysis – most recently through 9/30/2011.
    As of that date, there were 4,253 records in the database, of which only 2,892 reported a non-null number of records lost (Total Affected), totaling ~788M records.
    If you construct a simple percentile table, you will see that that the median number of records lost is 2,000. As your graphic shows, the Ponemon range (5K – 100K) falls roughly between the 60th and 90th percentiles. The mean number of records stolen = ~272K and falls between the 90th -95th percentiles.
    In fairness to Dr. Ponemon, I don’t think that he ever said that his sample was representative – just that it was what it was.
    BTW – I really appreciate the work that you do here – people complain that there isn’t enough data – I complain that too few people take the time and trouble to understand the data that we have.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *