Benford’s law and the size of data breaches
In a previous post we noted how the size of data breaches seems to follow a lognormal distribution. This happens when the data doesn't follow a normal distribution, but its logarithm does. It turns out that there's some more interesting structure in the size of data breaches, and this is due to the way in which the size of data breaches follows Benford's law.
Benford's law says that if you look at first digit of data, each digit appears with a probability of roughly p(d) = log(1 + 1/d). Data that follows Benford's law starts with a '1' about 30% of the time and with a '9' about 5% of the time, for example. Not all data follows Benford's law, but enough data does to make it interesting.
Benford's law often holds when data comes from an exponential process, like you might see in population growth. To see why this happens, consider the population of a country that starts with 1 million people and grows at 10% per year. Here's what you get for the population as it grows for 25 years. Note that the first digit of the population is a '1' more often than it's a '2' and so on. After 25 years we're back to having '1' for the first digit so that we'll see it more frequently over the next 25 years also.
The size of data breaches also seems to follow Benford's law. Here's a graph that compares the frequecies of two sources of initial digits. One comes from the size of the data breaches that are currently listed in the OSF's data breach database. The other is what you'd expect from Benford's law. That looks like a good fit to me, but I'll leave it to someone else to do a more careful analysis of whether or not there's a significant difference between what we see and what we expect.
The fact that the size of data breaches seems to follow Benford's law tells us that there may be a relationship between the size of data breaches and some sort of exponential process. What sort of process could that be?