Understanding the apparent drop in data breaches in 2010
In a previous post I conjectured that the apparent decrease of data breaches in 2010 was just due to the lack of really big breaches that year. A closer look at the data seems to suggest that this is true.
Because the size of data breaches follows a lognormal distribution, it's easy to get a rough idea of how surprised we should be by the lack a really big breach in 2010.
The size of breaches has a (base 10) logmean of 3.4 and a logdeviation of 1.2. If we define a "really big breach" as being one that exposes 5 million or more records, then having a really big breach corresponds to seeing a standard normal z value of
(log 5 million – 3.4) / 1.2 = (6.7 – 3.4) / 1.2 = 2.75
or greater. From a normal distribition table we find the probability of z ≥ 2.75 is 0.003, so that the probability of a breach not being really big is 0.997.
There are 324 breaches listed in the OSF's data breach database for 2010 for which the size is known. The probability that each of these ended up exposing less that 5 million records is
0.997324 = 0.378
So there's about a 38 percent chance of seeing what we saw in 2010 – no really big breaches – just due to chance alone.
That's a probability that's too big to ignore. The usual criterion for saying that an event is rare enough to be significant is that it happens less than 5 percent of the time, and by that definition, the lack of really big breaches in 2010 doesn't really look too surprising.
So it certainly looks like we can probably explain the overall decrease in the number of records exposed by data breaches in 2010 as being caused juist by good luck instead of by better security or compliance efforts. That's probably a better explanation than the one that Verizon came up with in their 2011 DBIR: "Our leading hypothesis is that the successful identification, prosecution, and incarceration of the perpetrators of many of the largest breaches in recent history is having a positive effect."