Benford’s law and the size of data breaches

In a previous post we noted how the size of data breaches seems to follow a lognormal distribution. This happens when the data doesn't follow a normal distribution, but its logarithm does. It turns out that there's some more interesting structure in the size of data breaches, and this is due to the way in which the size of data breaches follows Benford's law.

Benford's law says that if you look at first digit of data, each digit appears with a probability of roughly p(d) = log(1 + 1/d). Data that follows Benford's law starts with a '1' about 30% of the time and with a '9' about 5% of the time, for example. Not all data follows Benford's law, but enough data does to make it interesting.

Benford's law often holds when data comes from an exponential process, like you might see in population growth. To see why this happens, consider the population of a country that starts with 1 million people and grows at 10% per year. Here's what you get for the population as it grows for 25 years. Note that the first digit of the population is a '1' more often than it's a '2' and so on. After 25 years we're back to having '1' for the first digit so that we'll see it more frequently over the next 25 years also.

Year

Population

0

1,000,000

1

1,100,000

2

1,210,000

3

1,331,000

4

1,464,100

5

1,610,510

6

1,771,561

7

1,948,717

8

2,143,588

9

2,357,946

10

2,593,740

11

2,853,114

12

3,138,425

13

3,452,267

14

3,797,493

15

4,177,242

16

4,594,966

17

5,054,462

18

5,559,908

19

6,115,898

20

6,727,487

21

7,400,235

22

8,140,258

23

8,954,283

24

9,849,711

25

10,834,682

The size of data breaches also seems to follow Benford's law. Here's a graph that compares the frequecies of two sources of initial digits. One comes from the size of the data breaches that are currently listed in the OSF's data breach database. The other is what you'd expect from Benford's law. That looks like a good fit to me, but I'll leave it to someone else to do a more careful analysis of whether or not there's a significant difference between what we see and what we expect.

Benford's law    

The fact that the size of data breaches seems to follow Benford's law tells us that there may be a relationship between the size of data breaches and some sort of exponential process. What sort of process could that be?

Leave a Reply

Your email address will not be published. Required fields are marked *