Given a hash function; assuming domain % codomain = 0 and uniform collision distribution; how is entropy lost by collisions (not output truncation)?

Question

Assuming our domain and codomain both have 4 elements, the uniform distribution of inputs to outputs means the function is injective. It is widely agreed that randomly selecting a domain element results in log2(4) = 2 bits of entropy in the selection, and that because of being injective, the output of said function input this selection will also contain 2 bits of entropy.

If we double our domain element count to 8, then a random selection will have log2(8) = 3 bits of entropy. Because of pigeonhole principle there are collisions, but we defined a uniform distribution, so each of the 4 codomain elements has two domain elements mapped to it.

In the second case, clearly the 3 bits of entropy of the input can't be preserved in the output, but why would fewer than two bits of entropy be preserved? Isn't it still just equiprobably selecting an element from the codomain, given the lack of bias as defined by the uniform distribution of collisions?

In both cases, is it not preserving the entropy of the input up to the bit length of the output? Which is log2(4) to cover the 4 elements of the codomain.

Squeamish Ossifrage · Accepted Answer · 2019-11-15T06:59:27.290

2

Assuming our domain and codomain both have 4 elements, the uniform distribution of inputs to outputs means the function is injective.

You seem to be using the standard technical term ‘uniform distribution’ in a confusing way. Normally the uniform distribution on a finite set $A$ means the probability distribution $P$ with $P(x) = 1/\#A$ for all $x \in A$, where $\#A$ is the number of elements in $A$.

But you haven't mentioned a probability distribution so far; you seem to be abusing the term ‘uniform distribution’ to mean a function $f\colon A \to B$ with the following property: There is a single number $n$ such that for every $y \in B$, the number of elements in the domain mapped to $y$ is $\#f^{-1}(y) = n$. (One might call such a function ‘balanced’, particularly if it's a boolean function—i.e., a function defined on bits whose output is a single bit—but this nomenclature is not standard like ‘uniform distribution’ is in probability theory.)

It is widely agreed that randomly selecting a domain element results in log2(4) = 2 bits of entropy in the selection,

When you say ‘randomly selecting’, that doesn't specify what probability distribution you're randomly selecting by. But if the entropy of the selection is log2(4), then clearly you mean the uniform distribution on the domain. I recommend you specify a distribution whenever you talk about a random selection.

and that because of being injective, the output of said function input this selection will also contain 2 bits of entropy.

Yes, if $f$ is injective then $H[f(X)] = H[X]$ for all random variables $X$ with any probability distribution, including the uniform distribution.

If we double our domain element count to 8, then a random selection will have log2(8) = 3 bits of entropy.

Again, only if the random selection is uniformly distributed over the whole domain.

Because of pigeonhole principle there are collisions, but we defined a uniform distribution, so each of the 4 codomain elements has two domain elements mapped to it.

Yes, if by ‘uniform distribution’ you mean a function that has the same number of preimages $\#f^{-1}(y) \subseteq A$ for any element $y \in B$ in the image.

In the second case, clearly the 3 bits of entropy of the input can't be preserved in the output, but why would fewer than two bits of entropy be preserved? Isn't it still just equiprobably selecting an element from the codomain, given the lack of bias as defined by the uniform distribution of collisions?

Let's take a concrete example.

Define $f(x) = x \bmod 4$ on $\{0,1,2,\dotsc,15\}$. You can easily confirm that $f$ has the property you called ‘uniform distribution’—every element of the image $\{0,1,2,3\}$ has exactly four preimages. That is, under $f$, the following sets of inputs obviously collide:

$\{0,4,8,12\}$
$\{1,5,9,13\}$
$\{2,6,10,14\}$
$\{3,7,11,15\}$

Consider the following two probability distributions on the domain of $f$:

$P(x) = 1/4$ for $x \in \{0,1,2,3\}$, and zero otherwise.
$Q(x) = 1/4$ for $x \in \{0,4,8,12\}$, and zero otherwise.

Clearly $P$ and $Q$ have the same entropy—2 bits. What is the effect of $f$ on the entropy?

Let $X \sim P$. Then $f(X)$ has four possible outcomes each with equal probability 1/4, so the entropy is the same: $H[f(X)] = H[X] = 2\,\mathrm{bits}$.
Let $X \sim Q$. Then $f(X) = 0$ with probability 1. So $H[f(X)] = 0$.

Obviously neither $P$ nor $Q$ is the uniform distribution on the domain of $f$. If we define $U$ to be that distribution—that is, $U(x) = 1/16$ for each $x \in \{0,1,2,\dotsc,15\}$, and draw $X \sim U$—then sure, $H[f(X)] = 2\,\mathrm{bits}$, the maximum possible.

edited Nov 15 '19 at 06:59

answered Nov 15 '19 at 02:51

Squeamish Ossifrage

48,392
3
116
223

Thanks, this response made a huge contribution to my understanding. Two things to add though. First, by uniform distribution of collisions I meant the first sentence here https://en.m.wikipedia.org/wiki/Discrete_uniform_distribution regarding collisions. Is this incorrect language in this context? Secondly, in the context of a diceware passphrase fed to KDF, so, because the set of such passwords (of any N entropy bits) can never cover the hash domain, an attacker can look for such biases like in your example. Is identifying such domain biases different from a rainbow table attack? Salt help? – Gratis Nov 15 '19 at 05:41
@Gratis Neither the first sentence—‘In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution whereby a finite number of values are equally likely to be observed; every one of n values has equal probability 1/n.’—nor any of the article mentions collisions, so it's not clear to me what you mean by ‘uniform distribution on collisions’, or how you connect that to whether a function is injective or not. That's why my best guess was that you meant the property of a function that I described, without reference to any probability at all. – Squeamish Ossifrage Nov 15 '19 at 06:08
@Gratis I don't know what you mean by ‘look for biases’ or ‘domain biases’, but: (1/2) We usually model a KDF as a uniform random function, meaning that the output is independent of the input (and of all outputs on distinct inputs) except insofar as you can use the output to test a guess about what the input is. We usually don't model any specific mathematical structure of the KDF like my $f$ example, beyond the ability to compute the KDF forward; generally, only a broken hash function has such exploitable structure. – Squeamish Ossifrage Nov 15 '19 at 06:17
@Gratis (2/2) Rainbow tables are a specific technique for using a hash function to conduct a random walk on a space like passwords that accelerates finding preimages under the hash function. There is an advantage to rainbow tables only when you can use the one random walk for one function to find many preimages, whether by precomputation that you reuse or in a parallel search on a batch. A distinct salt per user thwarts rainbow tables by effectively using a different function for each user. – Squeamish Ossifrage Nov 15 '19 at 06:19
I'm not very mathematically versed, so I struggle some to find the correct words. What I meant was "every codomain element has the same number of domain elements mapped to it, such that the function is injective if domain and codomain are the same size, and such that if the domain is larger, no skew is produced toward any codomain element, which requires domain % codomain = 0." By "look for domain biases" I meant "characterize the lack of uniform selection over the domain: realizing $X \sim Q$ and $Q(x) = 1/4$ for $x \in {0,4,8,12}$ and 0 otherwise, so fully biased to that domain subset." – Gratis Nov 15 '19 at 06:50
My question regarding rainbow tables was whether identifying such "domain biases," for example by knowing diceware was used to select an element from the domain (thereby knowing there is 0 probability of domain elements not containing diceware words), and then exploiting them, requires computing the function over the identified domain subset (such as P or Q from your example), such that being able to exploit this lack of uniform distribution (P or Q, not U) requires a computational step similar to forming a rainbow table -- but to identify whether the effect is like P or like Q, or otherwise. – Gratis Nov 15 '19 at 06:58
@Gratis So I guessed correctly what you meant, but ‘uniform distribution’ is definitely not the term for this property of a function; there's a term ‘balanced’ that is sometimes used to mean this but it's not very common since the property doesn't turn up much. (For example, SHA-256 is almost certainly not balanced.) I lost track of your last comment halfway through the syntax, sorry. You can certainly do a rainbow table random walk over the space of diceware passphrases. It's extremely unlikely that it would interact nontrivially with (say) SHA-256 but not MD5 like $Q$ interacts with $f$. – Squeamish Ossifrage Nov 15 '19 at 07:05
I suggest picking up an introduction to probability theory, or maybe a textbook on ‘discrete math’ which covers things like the basic set theory and discrete probability theory that are involved here—if nothing else, it will help with getting the standard terminology straight, which will help in formulating clear questions. – Squeamish Ossifrage Nov 15 '19 at 07:11
Are you sure that isn't correct terminology? I searched and found things indicating otherwise. For example, https://books.google.com/books?id=9NTSAgAAQBAJ&pg=PA67&lpg=PA67 seems to use the same language I did. "Suppose g(x) is an invertible function" and "because all the codomain elements have the same probability, the distribution of g(x) is uniform." Also https://www.rocq.inria.fr/secret/Pascale.Charpin/CCCF-euroc-00.pdf "functions which achieve perfect diffusion and perfect confusion (called bent functions) are not balanced; that means that they do not have a uniform output distribution." – Gratis Nov 15 '19 at 13:54
Also, https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-042j-mathematics-for-computer-science-fall-2010/readings/MIT6_042JF10_chap17.pdf says both "Notice that the name “random variable” is a misnomer; random variables are actually functions!" and "A random variable that takes on each possible value with the same probability is said to be uniform." under "Uniform Distributions." This comment chain is becoming long and off topic though (my fault), so I will appreciate your final response if you provide it, and then goodbye and thanks so much for your help. – Gratis Nov 15 '19 at 14:02
@Gratis The Google Books link does not work for me, but from what you quoted, it sounds like the assumption is that the input is uniformly distributed. That is also the scenario of the paper you cited: if $X$ is a random variable with uniform distribution, and $f$ is a balanced function, then $f(X)$ is a random variable with uniform distribution. – Squeamish Ossifrage Nov 15 '19 at 14:05
@Gratis ‘Random variable’ is a primitive concept in the language of probability theory that can be formalized in terms of the language of measure theory as follows: A probability space is a measure space $(\Omega,F,P)$ consisting of a sample space $\Omega$ (e.g., all possible observations of an experiment), a set $F$ of subsets of $\Omega$ called events, and a measure $P$ defined on $F$ such that $P(\Omega)=1$, $P(A\cup B)=P(A)+P(B)-P(A\cap B)$, etc. For an event $E \in F$ (e.g., the coin came up heads and the odometer read 100km), we call $P(E)$ the probability of the event $E$. – Squeamish Ossifrage Nov 15 '19 at 14:10
@Gratis In this formalization, ‘a random variable $X \in {0,1}$’ is defined to be a function $X\colon \Omega \to {0,1}$ from the sample space $\Omega$ (remember, a ‘sample’ here represents the combination of all outcomes from a run of an experiment: first coin toss, second coin toss, odometer reading, hygrometer reading, etc.) to the possible values of a part of a sample (e.g., $X$ represents the first coin toss outcome), and the formal notation $\Pr[X=0]$ means $P(X^{-1}(0))=P({\omega\in\Omega:X(\omega)=0})$, i.e. the fraction of all outcomes $\omega$ where $X(\omega)=0$. – Squeamish Ossifrage Nov 15 '19 at 14:15
@Gratis Again in this formalization, when $X$ is a random variable and $f\colon A \to B$ is a function and we write $f(X)$, what that means is the random variable $Y(\omega)=f(X(\omega))$; then for $y\in B$, the notation $\Pr[f(X)=y]$ means $P(X^{-1}(f^{-1}(y)))=P({\omega\in\Omega:f(X(\omega))=y})$, i.e. the fraction of all outcomes omega where $f(X(\omega))=y$. Now, if we have a random variable $X\in A$ with uniform distribution so $\Pr[X=x]=1/#A$, and if the function $f\colon A\to B$ is balanced, then the random variable $f(X)\in B$ has uniform distribution: $\Pr[f(X)=y]=1/#B$. – Squeamish Ossifrage Nov 15 '19 at 14:19
@Gratis The formalization can be pedagogically useful for explaining how to define the algebra of random variables to someone familiar with sets and functions, like writing an interpreter for an unfamiliar programming language in a familiar one to understand how it works. But in practical uses of the algebra of random variables, it's never necessary to define the full sample space $\Omega$, like ${0,1}\times{0,1}\times[0,10^9,\mathrm{km}]\times{1,2,3,4,5,6}$ for the two coin tosses, odometer reading, and die roll. We just imagine extending it for every new kind of observation we make. – Squeamish Ossifrage Nov 15 '19 at 14:24
@Gratis Once you have a handle on the rules of the language of random variables, the formalization of a random variable as a function on the sample space can go away as pedagogical scaffolding. Side note: Some authors are not consistent about labeling the measure $P\colon F\to[0,1]$ defined on events separately from the formal notation $\Pr[X=0]:=P({\omega\in\Omega:X(\omega)=0})$, and just use $P$ for it all. Sorry! Not my fault! (Also: in $P(A\mid B):=P(A\cup B)/P(B)$, the $\mid$ is part of the $P(\cdots)$ notation, not an operator on sets as ‘$A\mid B$’ like $A\cup B$. Sorry!) – Squeamish Ossifrage Nov 15 '19 at 14:31

score -1 · Answer 2 · answered Nov 15 '19 at 02:50

-1

Assuming our domain and codomain both have 4 elements, the uniform distribution of inputs to outputs means the function is injective.

No. This is not a cryptographic hash (what I call a pseudo random function). Simplistically: Hash -> avalanche effect -> bin collisions -> 37% rate -> non injective -> codomain =/= domain.

You may be over thinking this. NIST considers a thingie called "The narrowest internal width". That means irrespective of what happens at the input, only the inner width matters. Call it the throat size. Or, any size bucket leaks at the same rate if the hole is the same size. The inner width is usually what's output. An $n$ width cryptographic hash function can only output $ 2^n (1- \frac{1}{e}) $ unique values. The input domain is irrelevant because it must pass through the $n$ bit wide throat.

So if you conservatively round 37% to 50%, one bit of entropy disappears in a puff of XORs.

Remember 'A' in my previous diagram. It cannot be output due to collisions.

answered Nov 15 '19 at 02:50

Paul Uszak

15,390
2
28
77

1

‘a cryptographic hash (what I call a pseudorandom function)’ PRF is a standard technical term in cryptography with a specific definition; ‘cryptographic hash’ is a broad term often meaning collision-resistant hash function or something modeled as a random oracle, but with many possible definitions depending on context, most of which are not PRFs. – Squeamish Ossifrage Nov 15 '19 at 03:06
‘An $n$-width cryptographic hash function can only output $2^n (1 - \frac1e)$ unique values.’ This sentence is false. The quantity you named is an approximation to the expected number of outputs of a uniform random $n$-bit function on the set of all $n$-bit inputs, but there's no rule that, e.g., SHA-256 must attain exactly this number of outputs, and it would be rather surprising if SHA-256 on, say, $2n$-bit inputs didn't attain many more values; on $(n+k)$-bit inputs the expected number of outputs is roughly $2^n (1 - 1/e^{2^k})$, which rapidly approaches $2^n$ as $k$ grows. – Squeamish Ossifrage Nov 15 '19 at 03:11
@SqueamishOssifrage Who are these comments for? – Paul Uszak Nov 15 '19 at 03:22
2

They're for you and anyone reading the answer. You can act on them to correct the mistakes in your answer; readers can act on them to see the mistakes. – Squeamish Ossifrage Nov 15 '19 at 03:26
‘non injective -> codomain =/= domain’ This does not follow; perhaps you mean the image is smaller than the domain. The codomain of a function is a formal—and somewhat arbitrary—property of its definition. The image which is the set of values that the function actually attains: ${y \in B : \exists x. f(x) = y}$. Any function $f\colon A \to B$ can be turned into another function $f'\colon A \to (B \cup B')$ with a larger codomain by defining $f'(x) = f(x)$ for all $x \in A$, but the image remains the same no matter how you artificially extend the codomain. – Squeamish Ossifrage Nov 15 '19 at 03:27

Given a hash function; assuming domain % codomain = 0 and uniform collision distribution; how is entropy lost by collisions (not output truncation)?

2 Answers2

Linked