How much can I assume hashes to be different for different contents?

Question

I know that some hashes, like MD5 or SHA-1, that were previously thought to be safe are now known to be vulnerable to collision attacks. But it is obvious that collisions exist for all hashes, given that the space of possible hashes is smaller than the space of possible contents. For example, if one considers all possible files whose size is smaller or equal to the hash size, there must be some collisions.

However, I wonder if I can be sure that hashes will be different for “small enough” differences in contents. For example, for a given hash, can I assume that:

All contents whose size is = the hash size will have different hashes (so that if $H(m_1) ≠ H(m_2)$ then $H(H(m_1)) ≠ H(H(m_2))$)?
All contents smaller that $m$ bits/bytes will have different hashes?
All contents that differ by less than $m$ bits/bytes will have different hashes?
All contents that differ by less than $m$ consecutive bits/bytes will have different values?
Inserting less that $m$ bits/bytes within a content will change its hash?
Inserting less that $m$ bits/bytes at the end/beginning of a content will change its hash?
Anything else?

If there are such assumptions that are true, do they survive the hash being truncated?

I guess answers to these questions are very dependent with the chosen hash functions. I’m very interested by answers about hashes of the SHA-2 and SHA-3 families, but answers about other hash functions (even MD5 and SHA-1) are welcome as well.

@SteffenUllrich Point taken, is there a way I can request moving that question there, rather than duplicating it? — , Apr 15 '17 at 09:36
I've marked the question this way and if others do it to or if a moderator will do it it will be moved. — Steffen Ullrich, Apr 15 '17 at 09:40

score 6 · Answer 1 · edited Apr 16 '17 at 03:07

All contents whose size is = the hash size will have different hashes (so that if hash(file1) ≠ hash(file2) then hash(hash(file1)) ≠ hash(hash(file2)))?

No, but finding such a value should be impossible for a secure hash.

All contents smaller that m bits/bytes will have different hashes?

That depends on the value of m. If m = 1 (bit or byte) then it will be true for any secure hash. If m is very large we get back into the situation that there must be identical hashes because of the pigeonhole principle.

All contents that differ by less than m bits/bytes will have different hashes?

No, because of the pigeonhole principle again. No, but finding a pair of messages that collide should be impossible for a secure hash.

All contents that differ by less than m consecutive bits/bytes will have different values?

See above.

Inserting less than m bits/bytes within a content will change its hash?

See above.

Inserting less than m bits/bytes at the end/beginning of a content will change its hash?

See above.

Anything else?

Basically it all comes down on the basic properties of secure hash values.

If there are such assumptions that are true, do they survive the hash being truncated?

In general truncating a secure hash of course limits the security, but it should only harm security by 1 bit for each 2 bits removed (for collision attacks - possibly more for other attacks, but those would have a higher security in to deal with in the first place).

I guess answers to these questions are very dependent with the chosen hash functions. I’m very interested by answers about hashes of the SHA2 and SHA3 families, but answers about other hash functions (even MD5 and SHA1) are welcome as well.

The answers above are for generic secure hash functions. MD5 / SHA-1 are obviously not considered secure anymore.

Detailing each and every security property of each and every secure hash and testing if it is vulnerable to attacks is way too broad for any answer.

While the first property is unusual, it can be obtained; see this. — fgrieu, Apr 16 '17 at 07:21
Yeah, I didn't venture in the realms of hashes based on number theory, I stand corrected. — Maarten Bodewes, Apr 16 '17 at 10:56

fgrieu · Answer 2 · 2017-04-22T13:37:41.533

5

Additions to Maarten Bodewes's answer:

It is possible to construct a hash (collision-resistant, preimage-resistant, and behaving mostly like a random function) with the property that

All contents whose size is = the hash size will have different hashes (so that if $H(m_1)\ne H(m_2)$ then $H(H(m_1))\ne H(H(m_2))$ )

One method is to start from a normal hash of $b$ bits, and special-case what happens when the message $m$ is exactly $b$ bits, where the hash is defined to be $P(m)$ with $P$ a fixed one-way permutation of $b$ bits. For large enough $b$, we can construct $P$ based on the discrete logarithm problem, and cycling.

Example with $b=2048$: let $p$ be the smallest prime at least ${\pi\over3}2^b$ with $q=(p-1)/2$ prime, and $g$ the smallest integer at least ${\sqrt5-1\over2}p$ with $g^q\not\equiv1\pmod p$; that is $p=\left\lceil{\pi\over3}2^b\right\rceil+3115515$ and $g=\left\lceil{\sqrt5-1\over2}p\right\rceil$

If the message $m$ is exactly $b$-bit
1. convert $m$ to integer using big-endian convention, giving $x$;
2. let $x\gets(g^{x+1}-1)\bmod p$ and repeat until $x<2^b$;
3. convert $x$ to $b$-bit bitstring using big-endian convention, giving $H(m)$.
Otherwise (message shorter or larger than $b$-bit), let $h\gets\operatorname{SHA-512}(m)$ and let $H(m)$ be $\operatorname{SHA-512}(h\|'0')\|\operatorname{SHA-512}(h\|'1')\|\operatorname{SHA-512}(h\|'2')\|\operatorname{SHA-512}(h\|'3')$

Given how $p$ and $g$ are chosen, $x\to g^x\bmod p$ is a permutation of the set $\{1,2,\dots,p-1\}$; it follows that step 2. implements a permutation of the set $\{0,1,\dots,2^b-1\}$; it follows that no two $b$-bit messages collide. Without proof: the best methods we have to find a collision or preimage involve breaking $\operatorname{SHA-512}$ or solving a hard discrete logarithm problem.

Other one-way permutations allowing to reduce $b$ are discussed there.

From this, it is easy to construct a hash of $b$ bits so that all messages strictly less than $b$ bits will have distinct hashes; simply right-pad a message $m$ with a single 1 bit, then if the result is less than $b$-bit pad it with enough 0 bits to reach $b$ bits; then finally apply the hash defined above.

edited Apr 22 '17 at 13:37

answered Apr 16 '17 at 07:11

fgrieu

140,762
12
307
587

If a person really needs that 1st property or that 2nd property, perhaps that person really wants a single-block cipher function, rather than a hash function: two different full-size blocks of plaintext, encrypted with the same secret key in ECB mode, will always encrypt to two different blocks of ciphertext. Also two different strictly-less-than-b blocks of plaintext, bit padded as you describe, encrypted with the same secret key, will always encrypt to two different blocks of ciphertext. – David Cary Apr 18 '17 at 15:18
My question was about the properties of “standard” hashes, not about designing a dedicated hash. However, that’s an interesting construction, so I upvoted your answer. However, isn’t there an evident second-preimage attack between $b$-bit messages and less-than-$b$-bit messages? Shouldn’t one use 2 different generators for both constructions? – user2233709 Apr 22 '17 at 09:46
@user2233709: no, there is no evident 2nd preimage attack between $b$-bit messages and less-than-$b$-bit messages. If you start with a less-than-$b$-bit message $m$, finding a $b$-bit message with the same hash $h=H(m)$ involves solving for $x$ the equation $g^x=h\bmod p$, and given the construction of $p$ and $g$ that's non-trivial. If you start with a $b$-bit message, the problem most probably has no solution at all; and when it has, finding one seems to involve a preimage attack against SHA-512. – fgrieu Apr 22 '17 at 11:15
@fgrieu Either my question was not clear or there’s something I fail to understand… If $m_1$ is a less-than-$b$-bit message, then your proposal is to right-pad it to a $b$-bit message $m_2$ and then convert it to a number $x$ and compute $g^x \textrm{ mod } p$. Then $m_1$ and $m_2$ have the same hash. Then, an attacker who knows either message can trivially derive the other one. Isn’t that a second-preimage attack (ability to find a message with the same hash as another known message)? – user2233709 Apr 22 '17 at 12:10
@user2233709: No, in the first part, messages less than $b$-bit do not use arithmetic modulo $p$. They do in the second part, but this looses the property $H(H(m_1))\ne H(H(m_2))$. I clarified my proposal. – fgrieu Apr 22 '17 at 13:36
@fgrieu Ok, so your proposal in the second part is to use the arithmetic construction only for less-than-$b$-bit messages, and use a “standard” hash for $b$-or-more-bit messages? Wouldn’t it be feasible to combine both using a first $g_1$ generator for less-than-$b$-bit messages and a second, different, $g_2$ generator for $b$-bit messages? – user2233709 Apr 22 '17 at 13:44
@user2233709: yes, you correctly summarize my second proposition. Yes we can do what you suggest, and then we demonstrably have no collision between exactly-$b$-bit messages, and no collision between less-than-$b$-bit messages. Further, $H(H(m_1))\ne H(H(m_2))$ when $m_1$ and $m_2$ are distinct and at most $b$-bit, except perhaps when exactly when one is exactly $b$-bit. However we have collision between any less-than-$b$-bit message and some particular exactly-$b$-bit message (it seems hard to exhibit one, especially if we also use different $p$ in the two cases). – fgrieu Apr 22 '17 at 14:00

Guut Boy · Answer 3 · 2017-04-18T19:16:04.993

TLDR: All but the second property cannot be assumed of general cryptographic hash functions. The first property could possibly hold for specific hash functions, but cannot generally be assumed. ~~The remainder are impossible for any hash function (given that the space of contents is larger than the space of possible hash values).~~

Below I explain in more detail.

All contents whose size is equal the hash size will have different hashes (so that if $H(m_1)≠H(m_2)$ then $H(H(m_1))≠H(H(m_2))$ )?

It may be possible to specifically design a hash function to have this property, but I do not know of any commonly used function with this property.

However, generally you cannot give this guarantee. A hash function could easily be secure while having a collision between two messages with size equal to the hash function output. By collision resistance of a cryptographic hash function it would be hard to find such a collision though.

In fact it is very likely that there is such a collision. To see this consider that there are as many contents of this size as there are possible hash values. Thus given the set of contents of this size and just one additional content we are certain to have a collision within this set.

All contents smaller than $m$ bits/bytes will have different hashes?

For "small enough" values of $m$ this property is actually required for a hash function to be collision resistant. This is because for sufficiently small $m$ we could simply bruteforce our way to a collision for any function that does not have this property.

All contents that differ by less than $m$ bits/bytes will have different hashes?

EDIT: as pointed out in the comments the following argument does not hold (hence the strike-through).

A hash function cannot have this property. To see this consider that all contents differ from some other contents by $m$ bits/bytes or less. In fact, we can go from any content $c$ to any other content $c'$ by $m$ bits/bytes increments. Thus this property actually implies a hash function without collisions, which is generally not possible.

All contents that differ by less than $m$ consecutive bits/bytes will have different values?

Inserting less that $m$ bits/bytes within a content will change its hash?

Inserting less that $m$ bits/bytes at the end/beginning of a content will change its hash?

~~For these questions the same argument as above holds. I.e., these properties all imply a hash function without collisions.~~

The last argument about going from any content $c$ to $c'$ is not correct, because it doesn't ask for collisions over multiple steps. For inputs of small lengths this works. And for $m=1$ it might actually work in general, I am not sure: Consider a message of length $2^n+1$ with $n$ denoting the output length of the hash in bits. Then there are $2^n+1$ other messages of the same length with Hamming distance $1$, and the pigeonhole principle says at least two have the same hash. And by design those two have a Hamming distance of $2$. But this doesn't contradict the case $m=1$. — tylo, Apr 18 '17 at 14:44
You are right, I must have made a mistake there. I will update the answer. — Guut Boy, Apr 18 '17 at 19:11

score 1 · Answer 4 · answered Apr 15 '17 at 16:19

No, you can't assume that hashes of messages with the same size as the hash are different when the message is different. Some hash algorithm like MD5, SHA-1 and SHA-256 work on blocks of the same length as the resulting hash. They also pad the message to make it unique for every message input and to try to ward of length extension attacks. This means that for message which are of the same length as the resulting hash, 2 blocks will be digested: the message itself and padding. Hashing function also are in all / most cases not bijective.
For SHA3 (Keccak) this is slightly more complex because the algorithm works with a sponge construction, while the other ones mentioned all work with a Merkle–Damgård construction. However, SHA-3 was also not designed to give anyone any possible information about the hash in regards to the input message. Any special property could be exploited in attacks. Because of this cryptologists often want a hash algorithm to behave like a random oracle which every finalist of the SHA-3 competition (so also Keccak, the new SHA-3 algorithm) was evaluated under. 1, 2

The same should be true for smaller messages. There may be algorithm which have this property, but most common ones like MD5, SHA1, SHA256 were not designed with this in mind. You could brute force every small message to see if there are duplicates, but as long as the algorith is still secure (MD5 and SHA1 are NOT) with overwhelming probability you won't find any.

Changing parts of the message should always (with overwhelming probability) result in a different hash. Some algorithms, especially older ones like MD5 are susceptible to length extention attacks. 3 This does not mean that you can easily create a message which has the same hash as another one, but still has some security problems depending of your protocol. Note that the entries for the SHA-3 competition where required to have defenses against length extension attacks and Keccak is not susceptible to them as far as we know.

Standard disclaimer: Please note that MD5 and SHA1 are broken. Don't use them for anything anymore if you are not really, really sure that you know what you do. All this statements only apply for still secure hash algorithms. MD5 and SHA-1 are not secure.

How much can I assume hashes to be different for different contents?

4 Answers4