2

I came across some source code that loosely does the below in order to achieve a 32 bit hash.

The input string is passed through MD5 to get 16 bytes Hash (as usual). Then the 16 bytes are split into 4 byte words . Each of these words are $\oplus$ with each other in order to get a final 4 byte word, this is considered as a 32 bit hash.

The application does not really care about security in sense of pre-image resistance etc. but all it wants is Collision Resistance.

So, does the above guarantee no collisions (any better than birthday-bound)? Of course it depends on the input string's entropy etc., but given that we don't know before hand what possible strings could be passed as input: how do we analyze such scheme for collisions?

e-sushi
  • 17,891
  • 12
  • 83
  • 229
sashank
  • 6,174
  • 4
  • 32
  • 67
  • As for the last question, I guess we don't analyze such schemes as $2^{32}$ let alone $2^{16}$ are considered very secure. MD5 cannot be used for collision resistance anymore; as MD5 collisions can be created, it doesn't take much imagination that if you can get the same MD5 output that any deterministic algorithm performed on just the MD5 output must result in the same value, and will therefore also have collisions. – Maarten Bodewes Oct 06 '14 at 17:31
  • Sounds like overkill to me. If all you want is collision resistance, then why not use a simple algorithm like djb2? – r3mainer Oct 06 '14 at 18:34
  • @squeamishossifrage do you have any references on the collision resistance of djb2? If it has not been properly vetted by the cryptographic community, then I wouldn't use it in a case where we don't want collisions (that said, a work factor of $2^{16}$ isn't much, so I wouldn't use that either). – mikeazo Oct 06 '14 at 18:44
  • 2
    @mikeazo No, but the OP doesn't seem too bothered about cryptographic hashing. (32 bits is nowhere near adequate for this purpose.) – r3mainer Oct 06 '14 at 18:51

3 Answers3

2

As far as I understand, the scheme is:

$$MD5(x) = a_1||a_2||a_3||a_4 \, \, \Longrightarrow \, \, H(x) = a_1 \oplus a_2 \oplus a_3 \oplus a_4,$$

with $a_i$ 4-byte/32-bit words.

Obviously you can't guarantee a unique 32-bit hash from an unbounded domain, due to the pidgeonhole principle. Neither can you make finding collisions infeasible, since $2^{16}$ MD5 invocations takes a few seconds if that.

If MD5 was an ideal 128-bit hash, it wouldn't matter whether you take the first 4 bytes, XOR the 32-bit words or anything else. All of them would be ideal 32-bit hashes. However, it isn't and you can find collisions in about $2^{33}$ operations. That doesn't directly lead to an attack on this, because it's slower than brute force, but it's possible it could be extended. There's a chance the XOR could make such an attack slightly harder or easier than truncation, but it really doesn't matter with a 32-bit hash since brute force is sufficient.

Assuming you don't let an attacker control the inputs (since that would be totally broken), the question isn't about the collision resistance, but pseudorandomness of MD5. MD5 is still thought to behave like a PRF in the sense that if the input strings are chosen "randomly", the output is uniformly random as well. (Which is why HMAC-MD5 is secure.) XOR doesn't affect that, so you should have the 32-bit hash behave like a PRF as well.

So, with inputs that aren't under the attacker's control, you should see collisions at the birthday bound. No sooner or later, other than what variance you expect with a 32-bit hash. However, there's no real benefit to the XOR, so you might as well just use the first 4-byte word.

otus
  • 32,132
  • 5
  • 70
  • 165
0

You cannot escape the birthday bound - it will eat your lunch lunch every time.

MD5 has a good uniform distribution, so should your algorithm. Since it outputs 32-bit values, you should expect collisions after around $\sqrt{\frac{\pi}{2}2^{32}} = 82137$ hashes.

user13741
  • 2,627
  • 11
  • 16
  • 2
    Why exactly $\frac{\pi}{2}$? Never heard that before. – Nova Oct 07 '14 at 21:22
  • i have written a bruteforce approach for MD5 in java for the above and i get collisions much sooner than 65536 limit which is 2^16. with SHA-256 am not getting the same – sashank Oct 07 '14 at 21:57
  • @sashank that seems strange. Are you willing to share your code? I'd like to see what I get. – mikeazo Oct 07 '14 at 23:03
  • @mikeazo not sure if you have noticed, it is not for entire MD5 but the tweaked version mentioned above, code is here http://pastebin.com/fZ1LR9as – sashank Oct 08 '14 at 07:54
  • @sashank, what's the average time by which you get collisions, if you change the "Hello" part? With a single prefix you expect a collision in the first 10000 about 10% of the time. – otus Oct 08 '14 at 08:03
  • @otus hardly few nano seconds may be . the program quickly exits. Even if i remove the "Hello" part. you can try it yourself its easy – sashank Oct 08 '14 at 08:21
  • @sashank, sorry, by "time" I meant the iteration count. Trying the equivalent in Python I see ~75k iterations on average. – otus Oct 08 '14 at 08:26
  • @otus 20934 is the minimum i could get for MD5 , for SHA-256 it was more than 70k, just pull the code in any IDE eclipse or intellij and run it even on CLI . – sashank Oct 08 '14 at 08:53
  • 1
    @sashank, I really hate Java, but here goes. Average for MD5 89031 and SHA-256 78676. Well within reason. – otus Oct 08 '14 at 09:06
  • @sashank, One thing I noticed with your code is that since the string generation is deterministic, it always finds the same collisions. I changed it so it appends a random integer to "Hello" instead. Then you get some variation. I did this and ran the code 1000 times. That resulted in an average of around 57k iterations. If I run it more, it should get closer to the expected 65k. – mikeazo Oct 08 '14 at 11:55
-2

Use something more collision resistant like SHA... I can't find any hashes that are completely collision proof, but sha at least decreases the collisions...