1

It is often suggested that hashing of long character columns can be used for indexing in databases, but the possibility of collisions is an issue for Unique indexes. Whilst I know that both MD5 and SHA-256 can (rarely) produce collisions, I was wondering what the possibility of them both clashing at the same point with the same input? In other words, if I produce a Unique key by running both MD5 and SHA-2 against the same character field and combining them to make the Unique key, is this collision free.

user3621075
  • 113
  • 2
  • 2
    If you find a collision for SHA256 you will be famous. The collision probability is $2^{128}$ with 50%. If you fear just use a 512 bit hash like SHA-512. And note that there question and anwers for this in this site. – kelalaka Sep 03 '20 at 16:11
  • 1
    Perhaps you meant SHA-1 instead of SHA-256? If so, see https://crypto.stackexchange.com/questions/36988/how-hard-is-it-to-generate-a-simultaneous-md5-and-sha1-collision which asks about the effort needed to find a simultaneous collision... – poncho Sep 03 '20 at 18:07
  • 1
    related https://crypto.stackexchange.com/questions/270/guarding-against-cryptanalytic-breakthroughs-combining-multiple-hash-functions – Richie Frame Sep 03 '20 at 20:24
  • 1
    @kelalaka to be pedantic, the probability of a collision after $2^{128}$ (uniformly random) hashes is approximately $e^{-k^2 / 2n}$ (with $k = 2^{128}, n = 2^{256}$), which is slightly over $60%$. Of course, your heuristic is close enough for practical purposes. – Daniel Lubarov Sep 04 '20 at 06:58
  • @DanielLubarov Yes, If we provide better approaches. It was a mobile comment, anyway, I should use $\approx$ – kelalaka Sep 04 '20 at 08:27

1 Answers1

3

Currently no collisions are known for SHA-256 and it can be used safely to hash long texts and you can be confident you won't happen to get a collision by chance.

Generally speaking if we have two different hash functions and $f$ and $g$ and you hash some input $x$ with both and use the concatenation $f(x)||g(x)$ the collision resistance of this will be better than either function individually and in some cases much better. If the functions are unrelated and you are only worried about chance collision the likelihood of a collision will be the product of the likelihood of collision in each function individually. However this last statement will not hold in face of an attacker exploiting weakness in these function. It's still harder that either individually but may not be noticeably harder to find a collision than finding a collision in just one of the two.

Other techniques of combining the functions such as $f(g(x))$ are ill advised and won't add security, since any collision of $g$ is also a collision in the composition.

Meir Maor
  • 11,835
  • 1
  • 23
  • 54
  • If you need more speed than SHA256 with even more negligible collision risk consider Blake3 with 512-bit output. – SAI Peregrinus Sep 03 '20 at 17:30
  • @kelalaka A collision in $g$ directly implies a collision in $f\circ g$ – Mark Schultz-Wu Sep 04 '20 at 02:35
  • @Mark you gave one, I want to be written in the question by Meir Moar. – kelalaka Sep 04 '20 at 08:28
  • 1
    I have accepted this answer, thank you. Your last paragraph is self evident and the main body of your argument largely what I had thought may be the case. I am getting the impression that, for my use case, just using SHA-256 would be sufficient. – user3621075 Sep 04 '20 at 08:49
  • 1
    @user3621075 forger MD5, run away from it!. – kelalaka Sep 04 '20 at 09:10
  • 2
    @SAI Peregrinus: Note that BLAKE3 carries 256 bits of state between blocks, regardless of the output length. So using a 512 bit output doesn't generally mean you'll get more collision resistance. – Jack O'Connor Sep 05 '20 at 17:11