Cryptography for Mere Mortals #7

An occasional feature, Cryptography for Mere Mortals attempts to provide clear, accessible answers to questions about cryptography for those who are not cryptographers or mathematicians.

Q: The previous installment promised to talk about “hashes”. Corned beef?

A: No, a cryptographic hash is something different—not quite as tasty—although it does involve chopping the input into small pieces. This installment is the first of several that present a simplified discussion of hashes and related technologies.

Wikipedia says, in part:

A cryptographic hash function is a hash function, that is, an algorithm that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an (accidental or intentional) change to the data will (with very high probability) change the hash value.

In other words, a hash function takes data of any size and creates a correlated value that’s a fixed size. Since the hash result is typically shorter, this means there will be “collisions”—the same hash value produced from differing inputs (see the section below for more detail on why collisions must exist). A cryptographic hash function is a hash function that combines the input data in such a way as to generate a result that’s “relatively unique”. Yes, we’re all taught that uniqueness is an absolute, but good cryptographic hash algorithms make the inevitable collisions very rare, and hence “relatively unique” applies.

Consider a very primitive, non-cryptographic hash function, which takes the values of each digit of the input and adds them, throwing away any tens digit. You may have done this in grade school, and called it casting out nines. The result of this function against the value 12345 would be 5: 1+2+3+4+5 = 15; we throw away the tens digit and get 5. Obviously this is a very poor hash function, as there are tons of collisions (one in ten results, on average!). Casting out nines is good for a quick check against basic arithmetic errors, but not much use beyond that.

Modern cryptographic hash algorithms like SHA-256 are much more complex. They massage the input in very complex ways to generate values that fit the “relatively unique” criterion: in fact, nobody has yet found a SHA-256 collision, although they must exist.

Hash Collisions

The reason that we can say we know hash collisions exist is simple: there are more unique input values than can fit into the hashed output. If the hashed value is 256 bits (32 bytes) and you could restrict the input to 32 bytes or fewer, then there need not be any collisions. But one of the many reasons hashes are useful is that you can hash any value—larger or smaller than the hash size—and produce the same size result. So since the input has more possibilities than the output, collisions in the output are inevitable.

What’s interesting is how difficult it is to find a hash collision. Because the SHA-256 collision for a given eight-byte value might be generated from 200,000 bytes of text, searching for collisions is a brute force, trial-and-error process. And even if a collision is found, the odds of it mattering are relatively low: the SHA algorithms make it very unlikely that two values near each other in size will produce a collision, and in most uses, values being compared are somewhat similar in size. In other words, if your password is “password” (let’s hope not!), nobody is likely to enter the “matching” 200,000-byte value as an attempted password.

Leave a Reply

Your email address will not be published. Required fields are marked *