Data Breaches – Lognormal Distribution?
As I have noted before, the size of data breaches seems to closely follow a lognormal distribution. But if you spend a while collecting all that’s available about the breaches from last year and start writing R programs to analyze it, you soon find that this does not seem to be true anymore.
But if you plot the data, a reasonable explanation seems to appear: while lots of the data fits a lognormal model fairly well, there is very little information about very small breaches. This is not really too surprising. When data breaches were not as common as they are now, people might have been more diligent about reporting every breach, even if it exposed a few records. But now, it is probably the case that nobody really cares about the micro-breaches, so they are essentially not reported.
But in any case, here is a graph that summarizes the 2013 breaches that I could easily find information about. The grey bars represent the actual breaches while the blue line shows a lognormal model for the data. The two seem to agree fairly well, except for the very small breaches, where I could not find any such breaches reported, so I would guess that the same model is still fairly accurate today.