Is every output of a hash function possible?

Question

Is every output of a hash function (e.g. SHA1, MD5, etc) guaranteed to be possible, or, conversely, are there any output values that cannot possibly be created from any input? In other words, are hash functions surjective?

If so, what guarantees this? If not, is it possible to discover such impossible outputs via an attack faster than brute force?

Related (but not the same, as you don't have a limited input length): Is SHA-512 bijective when hashing a single 512-bit block? — Paŭlo Ebermann, May 23 '12 at 17:51
Duplicate on stackoverflow: Do cryptographic hash functions reach each possible value, e.g. are they surjective? — CodesInChaos, Dec 03 '15 at 07:06

fgrieu · Accepted Answer · 2024-02-08T09:47:35.660

For common hash functions designed before 2007, there is no proof that every output is reachable for some input (that is, no proof that the hash is surjective), but it is expected to be true. No general method better than brute force is known to check this, and brute force is entirely impractical.

By the coupon collector argument, it is expected to require $2^n\cdot(n\cdot\ln(2)+\gamma)+1/2+o(1)$ random values to reach all $n$-bit values picked at random, with $\gamma\approx 0.577216$. Translated to generic hashes, the number of distinct messages expected to be required to reach all output values is about $2^{134.5}$ for 128-bit (e.g. MD5), $2^{166.8}$ for 160-bit (e.g. SHA-1), $2^{263.5}$ for 256-bit (e.g. SHA-256), on the assumption that these hashes behave as random functions. This assumption is reasonable, as it is the design goal of a generic hash.

MD5, SHA-1, SHA-2 are Merkle-Damgård hash with Davies-Meyer round function, a message block (at most) twice as large as the output, and the length padding at most about a quarter of the block size. Under the assumption that the underlying block cipher is a Pseudo-Random Permutation, and by an adaptation of the above reasoning, it is very probable that all outputs are reached with a single message block; and even more probable that they all are reached with two message blocks. Anything else seems to require a severe defect of the underlying block cipher, and any known such defect seems considerably lesser.

SHA-3 is provably surjective thanks to it's sponge structure. The proof is for impractically long input (see this answer). However an argument similar to the above shows that a severe defect of the Keccak permutation would be required to have some output values unreachable much past the coupon collector's bound.

As stated by Jon Callas in an other answer, it is possible to construct hash functions which demonstrably do not reach all their output; and some that are even computationally secure. One example is $\mathcal{H'}=\mathcal{H}(\mathcal{H}(m)|1)$ where $|$ is bitwise OR, and $\mathcal{H}$ is a common hash function. $\mathcal{H'}$ reaches markedly less than half of its output space, but is likely as fine as $\mathcal{H}$ by other experimental metrics except speed, and expected effort to break collision resistance which is slightly reduced (by <30%).

It's possible to construct secure hashes that demonstrably reach all their output values for practical input size, using a one-way permutation over an interval. The technique in Burt Kaliski's One-way permutations on elliptic curves (in Journal of Cryptology, 1991) allows to construct such permutations over intervals of size matching the threshold for collision resistance of a hash, e.g. $2^{256}$, using prime order curves in Weierstrass form with prime order twist. That can be extended to 256-bit permutation, then 256-bit pseudorandom permutation, then surjective hash for input bounded by the output size (take a standard hash and when it's input is exactly the output size, replace it with the permutation).

Even H(H(m)) will likely not reach a lot of the output space. — CodesInChaos, May 30 '12 at 19:54
@fgrieu - No its demonstratable. H(m) maps {0,1}^infinity to {0,1}^N for an N-bit hash function. So, H(H(m)) at the outer function call takes {0,1}^N as input and maps it to {0,1}^N. Thus if we assume one single collision exists (which we expect with overwhelming probability for any hash function -- they intend to attack as random oracles not as permutations), then by the pigeonhole principle the whole output space is not reachable. — dr jimbob, Jun 09 '13 at 07:47
@dr jimbob: odds of a random oracle $R$ being such that $R(R(m))$ reaches all output space are overwhelming low. But one can't demonstrably (in the sense used in mathematics) go from that to a result for a concrete hash function $H$ such as SHA1 or MD5. — fgrieu, Jun 09 '13 at 10:55
@fgrieu Any one of these edits, I would have considered a good edit, but their combined affect is a bit Spammy IMHO. — Meir Maor, Feb 05 '24 at 11:07
I'm not 100% certain what you mean by "$\mathcal{H'}=\mathcal{H}(\mathcal{H}(m)|1)$ where $|$ is bitwise OR", but if you mean that effectively the last bit $\mathcal{H}(m)$ gets replaced by $1$, then that construction does not preserve collision resistance. — Maeher, Feb 06 '24 at 10:02
@Maeher: Yes I mean this. For common hash functions $\mathcal{H}$ (part of my statement, meaning they aim to be a PRF except for length extension property), e.g. SHA-256, $\mathcal{H'}$ is collision resistant in a practical sense. The expected work is lowered by less than 30% (best attack finds a collision on $\mathcal{H}$ truncated by one bit). — fgrieu, Feb 06 '24 at 11:27

score 10 · Answer 2 · answered May 29 '12 at 20:54

There is no general answer, because there's no general statement you can make about all hash functions. It depends on the hash function, and how it compresses.

If you found that this was true for a given hash function, that it didn't generate some outputs, then this would be a flaw. It is at least a distinguisher, and most likely is indicative of some larger flaw, but how large the flaw is depends on many, many things.

Consider this 512-bit hash function G = SHA512(MD5(M)).

It cannot generate all the 512-bit possible outputs, because its inputs are limited to the outputs of MD5. It will also collide with any M and M' that have an MD5 collision. But for other purposes, e.g. getting a key from a password with PBKDF2, it would work fine. Ish.

Jon

poncho · Answer 3 · 2024-02-07T20:38:23.987

Actually, it turns out that for all the SHA-3 hashes (SHA3-224, SHA3-256, SHA3-384, SHA3-512) that all hash outputs are possible.

The proof relies on the following properties:

That the permutation used within SHA-3 is, in fact, a permutation
That the output of all the hashes are within the rate (and it makes the proof slightly simpler if they are at least 1 byte less than the rate)
The SHA3 allows unbounded input lengths.

To prove this, we give a way that, given a hash output, gives a string that hashes to that value. Now, this is not a violation of preimage security - that string is potentially (that is, almost always) extremely long; that is, so long that the time spent in computing and hashing it is far longer than simple brute force.

The method is straight-forward; given an $n$ byte hash target xxxxxxx, and call the rate $r$ bytes (with $r > n$), we do:

First set the first $n$ bytes of the rate to xxxxxx. This is easy; we just set the first $n$ bytes of the preimage to xxxxxx.
Then, we consider the operation $\beta$ which is "xor the value 0x86 into the top byte of the rate, and then perform the Keccak permutation". It is easy to see that this $\beta$ operation is itself a permutation. And, for any finite permutation, that will return a value back to itself after a sufficient number of iterations, that is, $\beta^x( a ) = a$ for some finite $x$.
Call the initial state (with the xxxxxxx as the first bytes) $a$; find the $x$ for which $\beta^x( a ) = a$.
Generate an image that is $xr-1$ bytes long. This image is all zeros, except for the initial $n$ bytes (which is the target hash value), and every byte at offset $-1 \bmod r$, which is 0x86

When we give this image to SHA-3, it'll first set up the state to $a$, and then effectively apply $\beta$ $x-1$ times. Then, it'll apply the padding which is (because the length is $r-1 \bmod r$) xoring the last byte of the rate with 0x86), and then apply the permutation one last time; this is another application of $\beta$, and so the final state is $\beta^x(a) = a$. It then reads the first $n$ bytes of the rate as the hash output; that is the value xxxxxxx.

I now get it, thank you. And also that we can replace "unbounded input lengths" by some ridiculously large but finite value like $2^{1600}!,r$ and then some, and thus that's a true proof. — fgrieu, Feb 07 '24 at 19:56
@fgrieu: actually, it's only $2^{1600}r$ (we don't care if $\beta^{x-1}$ operation is the identity, only that it returns one specific element $a$ back to where it was), and so the bound is much smaller... — poncho, Feb 07 '24 at 20:11
Ah yes, we don't need to run blindly a fixed number of times to cycle, we can stop when the permutation state actually cycles back to the desired state. Much smaller number of steps, if not small. — fgrieu, Feb 07 '24 at 20:28

Is every output of a hash function possible?

3 Answers3

Linked

Related