Finding meaningless trends

There's lots of data available for the data breaches that have happened over the past few years. Let's look at some of this data. And when we do this, let's pretend that we either didn't take any classes in statistics in school or that we forgot everything that we learned in these classes.

One thing that we might notice is that the average size of a breach has increased over the past three years. The size of breaches has roughly a lognormal distribution, so let's plot the logmean that we see for breaches over the past three years. Here's what that looks like.

That certainly looks like a trend, doesn't it? And since we're pretending that we've forgotten how to look at data carefully, we're going to ignore the fact that a more careful analysis shows that there's actually no statistically significant difference between the logmeans for those three years. Instead, let's use those three points to predict what we'll see in the future.

The logmeans from 2008, 2009 and 2010 actually fit a straight line very well. The model

logmean(year) = -198.157 + 0.100267 * year

fits the data with a correlation coefficient of R = 0.9471 or R2 = 0.8970. And because the fit is so good, let's use that model to predict what we'll see in the future.

Here's what it predicts for how the average size of breaches will grow through the year 2020.


Holy cow! The size of the average breach will get bigger by a factor of almost 20! So the breaches that we hear about that expose 1 million records could be breaches that expose 20 million records by then. And the rarer breaches that expose 100 million records or more could be exposing billions of records in not too many years.

This interpretation makes absolutely no sense at all, of course. Although the size of a typical data breach did increase from 2008 to 2010, the increase wasn't statistically significant. That means that extrapolating what we'll see in the future from the trend that those three years suggests is totally meaningless. Don't forget this the next time that you see someone claiming that there are trends in data. An increase and a statistically-significant increase aren't the same thing.

Leave a Reply

Your email address will not be published. Required fields are marked *