Blog

# Benford’s law and the size of data breaches

In a previous post we noted how the size of data breaches seems to follow a lognormal distribution. This happens when the data doesn't follow a normal distribution, but its logarithm does. It turns out that there's some more interesting structure in the size of data breaches, and this is due to the way in which the size of data breaches follows Benford's law.

Benford's law says that if you look at first digit of data, each digit appears with a probability of roughly p(d) = log(1 + 1/d). Data that follows Benford's law starts with a '1' about 30% of the time and with a '9' about 5% of the time, for example. Not all data follows Benford's law, but enough data does to make it interesting.

Benford's law often holds when data comes from an exponential process, like you might see in population growth. To see why this happens, consider the population of a country that starts with 1 million people and grows at 10% per year. Here's what you get for the population as it grows for 25 years. Note that the first digit of the population is a '1' more often than it's a '2' and so on. After 25 years we're back to having '1' for the first digit so that we'll see it more frequently over the next 25 years also.

 Year Population 0 1,000,000 1 1,100,000 2 1,210,000 3 1,331,000 4 1,464,100 5 1,610,510 6 1,771,561 7 1,948,717 8 2,143,588 9 2,357,946 10 2,593,740 11 2,853,114 12 3,138,425 13 3,452,267 14 3,797,493 15 4,177,242 16 4,594,966 17 5,054,462 18 5,559,908 19 6,115,898 20 6,727,487 21 7,400,235 22 8,140,258 23 8,954,283 24 9,849,711 25 10,834,682

The size of data breaches also seems to follow Benford's law. Here's a graph that compares the frequecies of two sources of initial digits. One comes from the size of the data breaches that are currently listed in the OSF's data breach database. The other is what you'd expect from Benford's law. That looks like a good fit to me, but I'll leave it to someone else to do a more careful analysis of whether or not there's a significant difference between what we see and what we expect.

The fact that the size of data breaches seems to follow Benford's law tells us that there may be a relationship between the size of data breaches and some sort of exponential process. What sort of process could that be?

• #### John

Cool. Why isn’t anyone else analyzing the data that’s out there?

• #### Jon867

This is very interesting. I’d like to hear your thoughts on exactly why data breaches follow this pattern. There must be something going on that tells why this is.