For a hashing function like MD5, how similar can two plaintext strings be and still generate the same hash?

Question

When I say similar, I'm referring to the Hamming distance, the Levenshtein distance, or a similar string distance metric that measures how similar or dissimilar two strings are.

For instance, are there two plaintext strings with a Levenshtein distance of 1 which share the same MD5 hash? If not, do we know the smallest Levenshtein distance possible for a pair of strings which share the same MD5 hash? Is it even possible to determine this with certainty?

I'm asking about MD5 since it's a well-known and simplistic hash. But I'd love to know how this applies to SHA-2, bcrypt, or other common hash functions.

There's a similar question which is looking for the shortest length strings that can generate a collision, but I'm looking for the smallest string distance to generate a collision. The actual length of the source strings isn't important.

(Asking purely out of curiosity; I don't have a real-world use for this)

This is an interesting question, but I'm not sure what utility it could have. MD5 in particular isn't generally broken by knowing some part of the plaintext and trying variations of it; it's such a fast and parallelizable function that you can just try out hundreds of millions of random strings until you get one that matches. If you have any particular purpose beyond just curiosity (which I can understand, and doesn't make this a bad question) then you should mention it, because it might help us give a better answer with more relevant information. — , Jul 09 '19 at 04:41
@NicHartley Good call, thanks. I've edited in my reason for asking. Long story short: I'm just curious :) — , Jul 09 '19 at 04:49
Given pigeonhole principle, I'd be surprised if the answer to this isn't 1-bit. — Lie Ryan, Jul 09 '19 at 05:45
This may be very difficult to determine, from what I've read MD5 has a very good avalance effect. https://en.wikipedia.org/wiki/Avalanche_effect — Matthew, Jul 09 '19 at 15:13
@LieRyan: Nope, see AleksanderRas's answer and its comments. If you have N pigeonholes and M>N pigeons, at least one pigeonhole must have >1 pigeon, but you can still have M-1 empty pigeonholes. — MSalters, Jul 09 '19 at 15:17
By the way, Levenshtein distance is too abstract for this question, which is why most people are looking at Hamming distance. In particular, "basically the same except everything is shifted by one place" isn't generally very similar at all! — Josiah, Jul 11 '19 at 12:00
@LieRyan I agree. Even if the accepted answer establishes an upper bound of 2, I can't get rid of the feeling 1 might be possible (possibly with a string of 2^129*2^130 bits or longer). — domen, Jul 11 '19 at 15:00
@domen, it is not possible to use the pigeonhole principle to get the number down to 1, because there are functions for which the answer is 2 no matter how long your input string and you can't prove something that's false. Instead you'd need to use something particular to the hash function you're interested in (e.g. by finding an actual collision) — Josiah, Jul 11 '19 at 19:02
@Josiah, regarding your Levenshtein comment: in hindsight, I agree. Hamming distance is much more appropriate for this type of question. I wouldn't have added Levenshtein distance if I could ask the question again. However, since there are a few scattered comments which reference that metric, I'll leave it in. — , Jul 11 '19 at 19:05

score 56 · Accepted Answer · answered Jul 10 '19 at 19:36

This answer is based on the work by AleksanderRas, although my conclusion is different.

First, to lay out a definition, a hash is a function that takes an arbitrary length input to a fixed length output. For example, MD5 takes any input and produces a 128 bit output.

A cryptographic hash is a hash function which has certain additional security properties.

Because a hash function takes an arbitrary length input and produces a fixed length output, it is guaranteed that there are some inputs which produce the same outputs. These are collisions.

Finally, the Hamming distance is the number of bits by which two inputs of the same length differ.

For any hash function, whether or not it is a cryptographic hash function, there are inputs with a Hamming distance of 2 which collide. This can also be shown by the pigeonhole principle:

Suppose that the hash function returns an n bit output.
There are 2ⁿ possible outputs.
Consider a string B which is 2ⁿ + 1 bits long.
Consider then the set of all strings which differ from B in exactly one bit. There are 2ⁿ + 1 such strings.
The Hamming distance between any two different strings in this set is 2: a 1 bit change to get back to B and a second 1 bit change to get to the other string in the set.
Because there are more strings in this set than there are possible output hashes, at least two strings must share a hash.
Therefore the hash function has a 2 bit difference collision.

It is possible to construct a hash function which does not have any collisions between strings with Hamming distance of 1. This can be shown as follows:

Consider a string B
Consider a string C which has Hamming distance of 1 from B.
The parity of B must be different from the parity of C. That is, if there are an odd number of bits set in B, there must be an even number in C and vice versa.
Therefore any hash function which directly encodes the parity of the input, such as regular MD5 with the parity bit appended, will have a minimum Hamming distance of 2.

There are less trivial hash functions than the parity one which have a minimum collision hamming distance of 2. For example, CBC-MAC is a family of algorithms which encrypts a bitstring with a fixed key under CBC mode, and returns the last block. This meets the definition of a hash function: it takes an arbitrary length input and returns an output fixed at the size of the block. Although (like all hash functions) CBC-MAC is vulnerable to collisions, it cannot have a collision if all changes occur within a single block. (This property comes from the fact that it is an encryption function and therefore a permutation, but further elaboration would be off topic) Since a hamming distance of 1 corresponds to a single bit change, and that single bit change is necessarily in just one block, it cannot cause a collision.

This should not be taken to mean that the smallest Hamming distance between collision inputs for every hash function is 2. There are functions with a minimum Hamming distance of 1: for example, the trivial hash function truncate. That is, given an n bit hash function which simply drops all but the first n bits, varying bit n+1 will (because it is ignored by the algorithm) give a collision.

So, when it comes to particular hash functions, the answer could be 1 or 2.

Others have argued that for MD5 and other standard cryptographic hash functions it will probably be 1. This is a purely probabilistic argument, but in the absence of evidence to the contrary it is a reasonable to use probability with hash functions which are designed to behave randomly.

I think this explanation is both the most comprehensive and the easiest-to-follow. +1 — , Jul 10 '19 at 19:52

score 22 · Answer 2 · edited Jun 17 '20 at 08:17

22

The answer is 1 bit (Hamming-distance = 1) for any cryptographic hash algorithm.

There are definitely collisions, since the digest of the MD5 algorithm is always 128 bits long but there are more than 2¹²⁸ possible inputs.

We can explain this due to the Pigeonhole principle.

Mathematical explanation

Let's say we take an input message of 3 bits:

There are 8 possibilities in total, because 2³ = 8:

000
001
010
011
100
101
110
111

So for an input length of n bits we have 2ⁿ possible values.

If you take the first bit-string as an example (000) you can easily see that there are three possibilities that have a Hamming-distance of 1 (001, 010, 100)

In theory you could just take a bit-string of length 2¹²⁹ where all bits are zeros (000...000). We hash this bit-string and call it A. Then replace the first zero with 1 (000...001) and look for a collision with A, if not replace the second zero with 1 (000...010), and so on. This will definitely give you a hash collision since 2¹²⁹ > 2¹²⁸ (you have 2¹²⁹ possible inputs but only 2¹²⁸ possible outputs). This is the simplest example I can think of (although it would take far too long to achieve this).

Note that this is the case if the assumption holds up that MD5 is a perfect hash function (and it definitely isn't). In practice we could perform this experiment with far less than 2¹²⁹ bits and expect a collision.

Note also that you can't be sure to get every possible hash output with the procedure explained above. The pigeon principle only says that there are at least some collisions. There could be a hash value that doesn't correspond with any input, i.e. there is no input that can generate the hash value of 128 bits of zeros (000...000). We have the assumption that every hash value is possible but we can't prove it.

The same experiment could in theory also be performed with other hash functions (MD5, SHA1, SHA2, etc.) if we accept that there really is no limit of inputs (apparently there is an input-size limit). You would just have to change the length of the possible hashes for the experiment. It would even apply to a perfect hash function.

edited Jun 17 '20 at 08:17

Community

1

answered Jul 09 '19 at 08:05

AleksanderCH

6,435
10
29
62

5

@AleksanderRas A hash function is not required to be secure, e.g. hash maps usually use rather fast than cryptographic secure hashes. A hash function with that many collisions is still a crappy hash function, but it is a hash function nevertheless. – allo Jul 09 '19 at 12:02
2

@Josiah the pigeonhole principle does give us that there will be at least one collision in the set of A and its onehot variants. There is just no guarantee that there will be a collision with A itself. – Jul 09 '19 at 19:34
Another simple counterexample: consider the trivial hash created by concatenating two identical MD5 hashes. Obviously this has a Hamming distance of exactly twice that of MD5 itself. If A and B differ in one bit, then AA and BB differ in two bits. – MSalters Jul 09 '19 at 15:14
1

"just take a bit-string of length 2^129" shouldn't that be length 129 bits, for a total options space of 2^129 combinations? – user Jul 09 '19 at 16:01
@Josiah Your points make for an excellent answer. Care to post one? – Jul 09 '19 at 16:48
@John Ellmore, I intend to but have a bit of reading into a hunch to do first. – Josiah Jul 09 '19 at 17:21
1

@MSalters: The question is about inputs that differ by a certain number of bits and map to identical MD5 hashes; it's not about MD5 hashes differing by a certain number of bits. – ruakh Jul 09 '19 at 20:00
@JohnEllmore If the pigeonhole property would work except to show that there is at least 1 collision within 2 bits of change, you would only need a string of length 2^128 because the zero string is the +1 to the 2^128 strings with one 1 bit. – Jul 10 '19 at 13:12
@Darkhogg Note that a bitstring of length 129 has 2^129 possibilities, a sufficient amount to guarantee collisions in a 128 bit output space. A bitstring of length 2^129 has 2^(2^129) possible values, also sufficient to guarantee collisions in a 128 bit output space, but also quite a bit more than necessary. Note that the latter may actually be impractical to use to actually find a collision due to annoyances like the limited accessible material in the universe and the heat death of the universe. – 8bittree Jul 09 '19 at 21:50
6

@8bittree, while you're correct that a bitstring of length 2^129 has 2^(2^129) possible values, note that not all of these bitstrings are available for the proof. Because we're talking about small hamming distances, we are restricted to bitstrings with exactly one 1. – Josiah Jul 09 '19 at 22:17
9

I also don't understand why this is so highly up voted. This proof is straight up wrong and it's easy to find counterexamples as already stated – DreamConspiracy Jul 10 '19 at 05:06
4

This answer needs to be edited to say that the answer is either 1 or 2 bits, instead of saying that the answer is 1 bit. – Tanner Swett Jul 10 '19 at 10:32
3

Given that this answer is wrong, it should probably either edited to note as much, or deleted. – Jul 10 '19 at 18:31
Agree with others this should not be the top answer. I find @James_pic's answer closer to the question's intent: it's not a math question about arbitrary functions with 128-bit output, but about cryptographic hashes, which security folks usually model as random. But if there's going to be an answer about arbitrary functions it needs to be correct. – twotwotwo Jul 11 '19 at 03:39
11

To extend the counterexample into more of a hash, consider an algorithm which finds the regular md5 and then appends a 1 if the parity of the input string is odd and appends a zero otherwise. That means there cannot be a collision between two input strings with different parity. – Josiah Jul 09 '19 at 09:23
3

@CVn No, for the pigeonhole principle to apply you need more inputs than possible outputs. if the output space is 2¹²⁸, you need at least 2¹²⁸+1 inputs for you to guarantee a collision. – Jul 09 '19 at 20:51
25

I understand your argument, but it doesn't hold. The pigeonhole principle does not guarantee that there is a collision between the hash of A and that of any of the onehot strings. – Josiah Jul 09 '19 at 09:15
I won't delete my answer. But you can feel free to edit my answer if anything is still bothering you. I would gladly accept any edits that improve on the answer unless it changes the core of the answer too significantly, in that case you should add an answer yourself. – AleksanderCH Jul 12 '19 at 15:02
43

For a counterexample, consider a function which maps all strings with an even number of bits hot to A and all with an odd number of bits hot to B. Of course there are an appalling number of collisions, but none from single bit difference strings. (this function is of course not MD5, so md5 may indeed have a single bit difference collision.) – Josiah Jul 09 '19 at 08:48
@MSalters I think you are solving a different problem than the one in the question – we are not looking for input with 1-bit difference in output, we are looking for collisions with 1-bit difference between the inputs. – Paŭlo Ebermann Jul 12 '19 at 00:38
1

@Josiah I meant that you hash the bit-string that has only zeros and then take this as comparison after you replace 1 bit. Edited the answer to make it more clear. – AleksanderCH Jul 09 '19 at 08:55
@NicHartley sorry, I misunderstood your comment as you wanting the community or any moderators to delete the answer. – Paŭlo Ebermann Jul 12 '19 at 13:05
@NicHartley The author has edited, although only to add the words "cryptographic hash" to the headline summary. – Josiah Jul 12 '19 at 06:16
@PaŭloEbermann I did downvote and comment, as it happens. And in that comment, I suggested the author either edit to correct, edit to make note of its incorrectness, or delete it, because answerers shouldn't leave incorrect information up if they know about it. Now that you've reread my comment and read the entire thing, do you have any problems with what I said? – Jul 12 '19 at 01:14
26

Although there is a wonderful elegance to this explanation, I am not convinced that it shows a 1 bit bound. I think it shows a 2 bit bound. Yes, the pigeonhole principle says that there must be some collision in the 2^129 hashes of possible onehot strings, but two different onehot strings differ in 2 bits. – Josiah Jul 09 '19 at 08:44
@PaŭloEbermann Ah, yeah, rereading it I can see what you mean. Sorry, I phrased it poorly -- I did mean the author should edit or delete. – Jul 12 '19 at 14:22
@NicHartley the thing to do for wrong answers is to downvote them (+ comment that they are wrong, if needed), not to delete them (unless the answerer wants to do that). – Paŭlo Ebermann Jul 12 '19 at 00:39

score 16 · Answer 3 · answered Jul 09 '19 at 15:13

16

There are two answers to this: one practical, and one theoretical.

First, the practical one: MD5 is a broken hash function, and we know of collisions for it, and a quick web search turned up a collision with a hamming distance of 6.

Second, the theoretical one: Most cryptographic hash functions are designed to be a reasonable approximation of a random function (this isn't usually the definition you see in textbooks, but it's an important design goal, due to how hashes are used in practice). MD5 turns out to be a poor approximation of a random function (because it's known to be broken), but let's assume it's not.

If you take some random binary data, and a random neighbour (hamming distance 1), there's a one in 2^128 chance that there'll be a collision. Simply because there's a one in 2^128 chance of any other piece of data being a collision. That's very unlikely, but you can try again with a different piece of data and its neighbour. Every time you try, you've got a 1 in 2^128 chance of finding a collision, so if you keep trying forever (which is a very long time), you're almost certain to find a collision with a neighbour.

So the theoretical minimum collision distance is 1, and we suspect such a collision exists.

But in practice, the time you'd need to take to find this collision is prohibitively large (larger than the age of the universe). Indeed, in a well-designed cryptographic hash function, the time taken to find a collision at all (i.e, not limited to a neighbour) should be prohibitively large.

We shouldn't be able to find any collisions in MD5 at all, in a reasonable amount of time. The fact that we can, is why we say it is broken.

answered Jul 09 '19 at 15:13

James_pic

372
2
10

"so if you keep trying forever (which is a very long time), you're almost certain to find a collision with a neighbour" does not follow, because there are at most 2^l neighbors of a length-l input. In the random model, you'd need an input of astronomical length on the order of 2^128 to have a high probability of finding a neighboring collision. – R.. GitHub STOP HELPING ICE Jul 09 '19 at 18:11
1

@R.. Nah, you can make 2^128 pairs differing in one bit with only 129 bits of input: take any 128-bit string followed by 0 and pair it with the same string followed by 1. One in 2^128 pairs collide given 128-bit outputs, so you should be able to make short collisions with one-bit differences in inputs (if you had forever, of course). I added an answer that tries to spell it out more. – twotwotwo Jul 09 '19 at 18:53
@twotwotwo: I'm assuming a fixed input you want to have a neighboring collision with. – R.. GitHub STOP HELPING ICE Jul 09 '19 at 19:20
@twotwotwo, I think you're right that you'd want to use 123 bit strings before you could expect to find a suitable collision. Or more generally for an n bit hash, you want a solution to ceil(x = 2^(n+1-x)) – Josiah Jul 11 '19 at 23:17
1

(Slight rev. to my earlier comment: I don't think 129 bits is the minimum to get 2^128 pairs differing by a bit--maybe it's 123?--just the suffixes trick is easy to describe and makes it easy to count pairs.) – twotwotwo Jul 09 '19 at 21:16
3

I don't see that restriction in the question: it says "are there two plaintext strings with a Levenshtein distance of 1 which share the same MD5 hash", not "is there a collision with a distance of 1 from my fixed example string". It's sort of like the difference between a collision attack and a second-preimage attack. I think here, like in a collision attack, you can choose both inputs. – twotwotwo Jul 09 '19 at 19:26

score 12 · Answer 4 · edited Jun 17 '20 at 08:17

We can prove an upper bound of 2 bits (Hamming distance = 2) for any algorithm

Upper bound

This upper bound is for hashing algorithms whose output is a bit string of length 128 (like MD5). It can be generalised by replacing 128 with n

Let A be any bit-string of length 2¹²⁸.

Let S be the set of A and all its neighbors. Here a neighbor is a String that differs from A in exactly one place.

Since there are 2¹²⁸ bits in the string, |S| = 2¹²⁸+1

The Pigeonhole principle tells us that any hashing algorithm whose output is a 128 bits long string, must have at least one collision on the set S (the number of different strings is 2¹²⁸).

Since the Hamming distance between any two elements in S is at most 2, we have proven an upper bound.

Lower bound

We can prove a lower bound of 2 for a hash function that optimizes the minimum Hamming distance between collisions.

Intuition

Consider a hashing function that outputs the parity of an input string. This hashing function will not have any collisions on neighboring input strings. A hashing algorithm that optimizes the minimal distance between collisions will be at least as good.

Graph theory

With a bit of knowledge about graph coloring and bipartite graphs, we can create a slightly more formal proof.

Consider an undirected graph G . Its nodes are the binary input strings of some length n , and there is an edge between two nodes if and only if the input strings have a Hamming distance of 1. The color of the node will correspond to its hash.

Since the parity of a binary string is always different from its neighbors, this graph must be bipartite.

A bipartite graph can be colored with two colors such that no neighbors share a color. What this means is that a hashing function with at least one bit of output (two options), can avoid a collision between any neighboring input strings.

Conclusion

We have proven that for MD5, and any other hashing algorithm, there exist two input strings with a Hamming distance of at most 2, that will cause a collision.

For a hashing algorithm that maximizes this distance, we can prove a lower bound of 2.

@Nic Making an assumption "without loss of generality" is pretty common in math proofs or at least explanations (there's even a wiki page for it, weirdly enough). You can prove something for a special case and then show how to generalize it for all cases your proof is just as valid as if you'd done it more generally - but might be easier to follow. — Voo, Jul 13 '19 at 10:06
"It can be genearlized by replacing 128 with n" ... goes on to use 128 the whole way. Nice. I mean, yes, AFAICT the proof holds, so +1, but I just find that a bit silly. Also, why doesn't that intuition work? Under our definition of neighbor (exactly one bit flipped) you can't maintain parity, so if maintaining parity is required to collide, then neighbors can't possibly collide. With your upper bound earlier, that's the proof. Or is that not mathematically rigorous enough? (Genuine question, there, I'm rather awful at proofs) — , Jul 10 '19 at 18:38

score 12 · Answer 5 · answered Jul 09 '19 at 04:58

An important aspect of cryptographic hash functions is that even the smallest difference in input usually results in different output. But given the unlimited input space compared to the limited output space of the cryptographic hash it is likely that sequences with only small differences (like a single bit) but the same hash value exist.

But for a more reliable statement and maybe some math behind it I recommend to ask at crypto.stackexchange.com.

score 0 · Answer 6 · answered Jul 10 '19 at 00:03

0

Simple answer: MD5 is a finite set, meaning that since an MD5 is 32 characters long, made up of HEX characters, you could literally write out or calculate every combination. The input set however is infinite, there is no limit to the things that could be put into an MD5 hash. With an infinite input set and a finite output set, there must be overlap from different inputs.

answered Jul 10 '19 at 00:03

7

This does not answer the question which is not about whether there are colliding inputs, but whether it can be determined how similar those inputs might be. – Xander Jul 10 '19 at 00:34

score 0 · Answer 7 · answered Jul 09 '19 at 18:42

MD5 and SHA-1 are badly broken functions. But you can think about an abstract good cryptographic hash function, and pretend it generates a different random number of some length for each different input, and model the collisions you'd expect that way.

The XOR of two random hashes is another random number of the same length. So you can generate a random number, the length of your hash function's output, by picking a string s and XORing hash(s followed by byte 0x00) with hash(s followed by byte 0x01).

When the number you get from that XOR is 0, you have a collision. Now, try all 2^128 16-byte strings as s, and do the XOR of two hashes as above. One of the 2^128 128-bit random numbers you get will be zero, more likely than not--I think the probability is (very close to) 1-1/e.

If you get unlucky and don't get a collision, you try a few more times with 0x00 and 0x01 replaced by a different pair of suffixes that differ in one bit (e.g. 0x02 and 0x03, or multiple bytes when you run out of one-byte pairs). As you try more times, the chance you still don't get any collisions from a random-ish hash drops exponentially.

You can model it more precisely than that, and fill more details in. But I hope that's enough to intuitively suggest that a good hash will probably have a colliding pair of inputs that only differ by one bit and aren't much longer than the hash's output.

There isn't much you can do with that since you can't try 2^128 inputs to a hash; we set output lengths specifically to make those searches impossible. Fun to see that examples like that ought to exist out there, though.

For a hashing function like MD5, how similar can two plaintext strings be and still generate the same hash?

7 Answers7

Mathematical explanation

Upper bound

Lower bound

Conclusion