Blog

# Zipf’s law for data breaches

As we've seen in previous posts, there's lots of structure in the available data on data breaches. In particular, the size of data breaches seems to follow a lognormal distribution as well as Benford's law. It looks like we can add a third law that this data follows, and that's Zipf's law.

Suppose that we rank our data from largest to smallest. Zipf's law tells us that if we plot the log of the data versus the log of the rank we get a straight line. Zipf's law was first formulated based on the observation by linguist George Zipf that the frequency of words is inversely proportional to the rank of the word in a word frequency table. It also seems to hold for other data sets, like the size of US cities.

Let's look at the  the data breaches from 2007 through 2009 that are listed in the OSF's data breach database. Here's what we get when when we plot the log of the rank of a breach versus the log of the size of the breach. The blue dots represent actual data breaches. The red dots are what we get from fitting a straight line to the log-log data.

Although the fit is much better for the breaches that aren't either too big or too small, the line actually fits the overall data fairly well, having a correlation coefficient R2 = 0.873.