Collisions in a hash function from DNA testing

The usual metric for the strength of a cryptographic hash function is how much work it takes to find a collision, or where two inputs map to the same output. Note that if you're trying to find a collision, you don't get to pick either of the inputs that get mapped to the same output. So if "Alice" gets mapped to the message digest 0x35318264c9a98faf79965c270ac80c5606774df1 and you want another string that gets mapped to that same message digest, a collision won't do that for you.

Instead, you might find something like that both 0x356a192b7913b04c54574d18c28d46e6395428ab and 0xda4b9237bacccdf19c0760cab7aec4a8359010b0 both get mapped to the same message digest, and this sort of thing typically isn't as useful at a hacker as the case where you get to pick one or more of the inputs yourself. (This is a fake example, of course. I used SHA-1 to create those message digests and if I could find a collision in SHA-1, I wouldn't be writing this blog post right now, would I?)

Collisions actually happen fairly often with hash functions that don't have many possible message digests. In particular, the work of crime lab worker Kathryn Troyer in which she found matches in a DNA database for samples that were from different people. Here's how the Los Angeles Times described what she found:

State crime lab analyst Kathryn Troyer was running tests on Arizona's DNA database when she stumbled across two felons with remarkably similar genetic profiles.

The men matched at nine of the 13 locations on chromosomes, or loci, commonly used to distinguish people.

The FBI estimated the odds of unrelated people sharing those genetic markers to be as remote as 1 in 113 billion. But the mug shots of the two felons suggested that they were not related: One was black, the other white.

In the years after her 2001 discovery, Troyer found dozens of similar matches — each seeming to defy impossible odds.

Despite the Times' description of it, this really shouldn't be too surprising.

There's really a hash function used in DNA matching because lots of people are being mapped to just a few locations on chromosomes. And because we have a hash function with relatively few outputs we should expect to have lots of collisions, and that's just what Troyer found. It's not quite what the FBI described as "misleading" or "meaningless." Instead, it's exactly what we should expect. And it shouldn't really seem to defy impossible odds, but understanding that probably requires more about of hash functions than the typical person has. Or really cares to learn.

Leave a Reply

Your email address will not be published. Required fields are marked *