Checking encoded strings for a hash collision in Python

Question

There is a common term used in cryptography called a hash collision. If I am reading the definition correctly on Wikipedia, this can occur if two different data values give rise to the same hash value.

Duplicate hash, different input:

text1 encoded = hash1 
text2 encoded = hash1

The first code block is a binary value with a hash obtained from the digest() function, which I found on a website. The section code block is what I modified, which is what I'm understanding is a hash collision. Notice that the second code block is checking if the hash is a duplicate but the original string is different.

Can anyone explain if my second code block is a hash collision and if not, why? And explain how the first and second code blocks differ in terms of the definition.

https://www.learnpythonwithrune.org/birthday-paradox-and-hash-function-collisions-by-example/

Code Block #1:

import hashlib
import os
collision = 0
for _ in range(1000):
    lookup_table = {}
    for _ in range(16):
        random_binary = os.urandom(16)
        result = hashlib.md5(random_binary).digest()
        result = result[:1]
        if result not in lookup_table:
            lookup_table[result] = random_binary
        else:
            collision += 1
            break
print("Number of collisions:", collision, "out of", 1000)

Code Block #2:

Codes 0 through 31 and 127 (decimal) are unprintable control characters. Code 32 (decimal) is a nonprinting spacing character. Codes 33 through 126 (decimal) are printable graphic characters.

string.ascii_lowercase + string.ascii_uppercase + string.ascii_letters + string.digits + string.punctuation + string.whitespace + string.printable

import hashlib
import os
import random
import string
collision = 0
total_attempts = 100000
lookup_table = {}
for _ in range(total_attempts):
str = ''.join(random.choice(string.printable) for i in range(3))
str_encode = str.encode('utf-8')
hash = hashlib.md5(str_encode).hexdigest()
hash = hash[:3]

if hash in lookup_table:
    if str not in lookup_table[hash]: # hash is the same; string is different
        collision += 1 
        print(lookup_table[hash] + '_' + hash)
        lookup_table[hash] = lookup_table[hash] + ';' + str
else:
    lookup_table[hash] = ';' + str    

print("Number of collisions:", collision, "out of", total_attempts)

The question twice makes a confusion between code and what it attempts to perform, in "first code block is a binary value with a hash" and in "second code block is a hash collision". — fgrieu, Mar 21 '23 at 13:37

fgrieu · Accepted Answer · 2023-03-22T08:59:16.247

1

A hash collision is the circumstance in which two distinct inputs of a hash function have the same hash. A colliding pair is two such distinct inputs. For examples of MD5 collisions and colliding pairs, see there.

The first code block finds and counts events that are mostly hash collisions for MD5 restricted to it's first byte, but could also be accidental collisions between the values returned by os.urandom(16) (the later are so extremely rare that it's negligible in the result). The experiment is restricted to collisions detected among 16 inputs, and is repeated 1000 times. The code seems written to illustrate a variant of the usual birthday problem.

The second code block (in version 7) can't find any hash collision, for two independent reasons:

It's attempted to find MD5 collisions in a naive way: by hashing random strings, but (by the aforementioned birthday problem) we would likely need in the order of $2^{64}$ hashes to find one MD5 collision in this way, and we only do $10^5=2^{16.6\ldots}$. Hint: shorten the hash (not the input of the hash) to it's first three bytes so that there are collisions to observe.
lookup_table never contains more than one element, for it's reset inside the loop; FIX THAT!

The code distinguish hash collisions from accidental collisions in the hashed strings.

Note: one simple way to avoid collisions among the inputs of the hash is to generate the inputs incrementally. This allows to simplify the code.

edited Mar 22 '23 at 08:59

answered Mar 21 '23 at 13:36

fgrieu

140,762
12
307
587

I think I understand the first issue. You're simply saying there isn't enough variation of characters based on the string length, correct? By shortening the length, and by maybe introducing all ASCII characters, there would be a better probability, correct? My intention wasn't to find a collision, but to understand when it happens. Can you take another look and see if my logic is correct now? – JustBeingHelpful Mar 21 '23 at 23:06
I'm using this to figure out duplicate hashes with DIFFERENT string input in an application, that should not exist. It'll probably never happen, but I'm just doing it as a precaution to check for them. What I learned today is that there are four ways to denote the number of variances for the string input. It's based on ordering the items and repeating items in each position. In our case, order matters and repeating items can happen with string input for a hash. https://math.stackexchange.com/a/4663863/37313 – JustBeingHelpful Mar 22 '23 at 06:33
The other answers in this question were not very comprehensive, so I got all combinations of the combinations, if that makes sense. :) – JustBeingHelpful Mar 22 '23 at 06:39
The issue in both bullets stands uncorrected and both are a show stopper. – fgrieu Mar 22 '23 at 06:53
Okay, I fixed the second issue. The concatenation of ";" + str was in the wrong code block. Now it's nested in the second IF. As for the first issue, I will truncate the hash to the first three characters. Is that correct? hash = hash[:3]. In my actual program, I will use a 2 dimensional dictionary, where the 1st dimension is hash and the second dimension is str. But I just kept it simple here and just printed the collisions. – JustBeingHelpful Mar 22 '23 at 07:09
No the second issue is not fixed. Again, lookup_table never contains more than one element, for it's reset inside the loop, rather than before as it should. That's a programming error, and the fact it was not fixed yet shows the question is more about programming than about anything else. Don't worry; it takes practice to build a mind model of how (here, sequential) computing works. There's another lesser programming errors: the separator ; occurs in string.printable and thus is not fully guaranteed to play it's intended role. Also: see the final note for faster/cleaner code. – fgrieu Mar 22 '23 at 08:47
If I set it before the IF conditional, I cannot append additional str values for the same hash key. By having it in the loop, we set a single str value to the hash key when the hash is not found as a key in the dictionary. And if we do find the key but a different str value, we append. I could have used the append function as an alternative. I see what you mean with the ";". When I have a 2D dictionary, that won't happen. – JustBeingHelpful Mar 22 '23 at 08:56
Your main problem is with where lookup_table = {} is in the second code block. It must be moved up one line, and left one level. As is, there can't be any collision detected. – fgrieu Mar 22 '23 at 09:43
Oh da! I never ran the code in the second block. I just wrote it based on the first code block. Thank you for your help on this. – JustBeingHelpful Mar 22 '23 at 09:50
I was reading through some of the answers you have on this site. You must be an extremely brilliant person! – JustBeingHelpful Mar 22 '23 at 15:33

Checking encoded strings for a hash collision in Python

1 Answers1