Maximum entropy of a hash function?

Question

Let $H(h,k)$ be the expected entropy of some random oracle $X:\left\{0,1\right\}^h \to \left\{0,1\right\}^k$, where $h$ does not necessarily equal $k$.

Then, is it true that $\lim\limits_{h\to\infty}H(h,k) = k$ ? (for constant $k$)
If so, is the above still true if $h = 2k$? In other words, does $\lim\limits_{h\to\infty} h - H(2h,h) = 0$ ? (for common hash functions like SHA-256/512, the input block size is twice the bits of the output)

For #1, my hunch is that it is true simply by the Law of Large Numbers, where $X$ is more likely to become uniformly distributed in $\left\{0,1\right\}^k$ as $h\to\infty$.

For #2, I am not so sure. A uniform distribution in $\left\{0,1\right\}^k$ has variance $\sigma^2 \sim 2^{2k} = 2^h$, which may "cancel out" this effect of the Law of Large Numbers as $h\to\infty$.

*Edit -- to calculate the expected entropy, I followed fgrieu's approach from The effect of truncated hash on entropy, where fgrieu derived the expected entropy of a random $h=k$ oracle as $$H = h - \frac{1}{2^h}\sum_{j=1}^{2^h}n_jj\lg\left(j\right) \approx h - 0.8272453$$

However, for a more "general" random oracle (where possibly $h\ne k$), I obtained $$n_j = 2^k\binom{2^h}{j}\left(\frac{1}{2^k}\right)^j\left(1-\frac{1}{2^k}\right)^{2^h-j}$$ and so $$H = h - 2^{k-h}\sum_{j=1}^{2^h}\binom{2^h}{j}\left(\frac{1}{2^k}\right)^j\left(1-\frac{1}{2^k}\right)^{2^h-j}j\lg\left(j\right)$$ which does not seem to have a nice simple closed-form or approximation like in the $h = k$ case there.

*Edit #2 -- it seems Wolfram-Alpha also miscalculates the limits here. For example, in the simple case of a 1-bit random oracle $X:\{1,2,\ldots,n\}\to\{1,2\}$, the expected entropy would just be $$ H_n = -\frac{1}{2^n}\sum_{j=1}^{n-1}\binom{n}{j}\left[\left(\frac{j}{n}\right)\lg\left(\frac{j}{n}\right) + \left(1-\frac{j}{n}\right)\lg\left(1-\frac{j}{n}\right)\right]$$ (ignoring the two zero-entropy cases $j=0,n$, where $X$ either maps everything to $0$ or everything to $1$).

But, Wolfram-Alpha seems to suggest that $\lim\limits_{n\to\infty} H_n = 0$ (and even hangs when substituting $h=n$), when in fact actually $\lim\limits_{n\to\infty} H_n = 1$. This can be demonstrated by showing that $H_n$ is already bounded from below by a (much simpler) sum $$B_n = \frac{1}{2^n}\sum_{j=1}^{n-1}\binom{n}{j}\left[4\times\frac{j}{n}\left(1-\frac{j}{n}\right)\right] = 1 - \frac{1}{n}$$ such that $0 < B_n < H_n < 1$ for all $n > 1$, thus implying that $\lim\limits_{n\to\infty}H_n = 1$.

Unfortunately the linked approach flies in the face of a large body of work called the Left Over Hash lemma, NIST's opinion and that of commercial TRNG manufacturers. Imagine TRNGs that output << 1 bit /bit of entropy! Also do not be led astray by the h=2k thing. That's a direct result of the lemma I mentioned and only applies for 128 bit block sizes, e.g. MD5. — Paul Uszak, Apr 20 '23 at 04:52
@Paul Uszak: I wouldn't say my approaches to TRNG are entirely antagonist with practice. But I do have things to say against some malpractice in TRNG and their evaluation methods, including taking as argument of conformance that a black box test of a conditioned RNG passes; overcomplexity (are these to mask rigging?); prescribing unrealistically tight bounds for the monobit, poker test or similar (early FIPS 140, AIS31); using statistical approximations out of their applicability domain (AIS31); using references with errors (AIS31); not acknowledging or acting on report of these (AIS31). — fgrieu, Apr 20 '23 at 06:40
@fgrieu It sounds as if you’ve had some bad experiences of AIS31 :-( Yes, TRNG testing can be challenging, thus all the Qs herein. [ Flags wave & trumpets sound! ] I’m releasing ent3 soon, as an upgrade of the venerable ent test program. It specifically targets DIY TRNG makers in the 25kB to 1MB sample size range. In addition to the 6 original tests, there’s some additional ones and all have accurate $p$ values. I hope you comment on it... — Paul Uszak, Apr 21 '23 at 13:23

fgrieu · Answer 1 · 2023-04-21T08:42:21.577

Notations and preliminaries

The random oracle implements a random function $X:\{0,1\}^h\to\{0,1\}^k$. For a given such $X$

Let $j_i$ be the number of preimages in $\{0,1\}^h$ of a given element $i$ in $\{0,1\}^k$.

It holds $0\le j_i\le 2^h\,$ and $$\sum_{i\in\{0,1\}^k}j_i\ =\ 2^h\label{fgrieu1}\tag{1}$$

The entropy at the output of $X$ for uniformly random input is $$\begin{align}H&=\sum_{i\in\{0,1\}^k}\frac{j_i}{2^h}\log_2\left(\frac{2^h}{j_i}\right)\label{fgrieu2}\tag{2}\\ &=2^{-h}\sum_{i\in\{0,1\}^k}j_i\,\bigl(h-\log_2(j_i)\bigr)\label{fgrieu3}\tag{3}\\ \end{align}$$
Let $n_j$ be the number of $i$ with $j_i=j$.

It holds $$\displaystyle\sum_{j=0}^{2^h}n_j\ =\ 2^k\label{fgrieu4}\tag{4}$$ and by rearanging the terms in $\ref{fgrieu3}$ it comes $$H=2^{-h}\sum_{j=1}^{2^h}n_j\,j\,\bigl(h-\log_2(j)\bigr)\label{fgrieu5}\tag{5}$$

The distribution of $j_i$ over the $2^{2^h k}$ different $X$ is binomial, with mean $E(j_i)=\mu=2^{h-k}$ and standard deviation $\sigma=\sqrt{2^{h-k}(1-2^{-k})}$

The question is about the expected $H(h,k)=E(H)$ over the $2^{2^h k}$ different $X$. Expectation is linear, thus even though the $n_j$ are dependent, equations $\ref{fgrieu4}$ and $\ref{fgrieu5}$ rigorously yield $$\displaystyle\sum_{j=0}^{2^h}E(n_j)\ =\ 2^k\label{fgrieu6}\tag{6}$$ $$H(h,k)=2^{-h}\sum_{j=1}^{2^h}E(n_j)\,j\,\bigl(h-\log_2(j)\bigr)\label{fgrieu7}\tag{7}$$

This can lead to at least asymptotic formulas for $H(h,k)$. I used this approach there to derive $\lim\limits_{h\to\infty}h-H(h,h)=0.82724538915300508343173\ldots^+$ bit.

Answer

Yes, for fixed $k$, $\lim\limits_{h\to\infty}H(h,k)=k$.

Intuitive argument: as $h$ goes to infinity, the binomial distribution of the $j_i$ degenerates to Gaussian with same mean $\mu$ and standard deviation $\sigma$ as above. $\lim\limits_{h\to\infty}\sigma/\mu=0$. This implies that in $\ref{fgrieu7}$, the terms of the sum that matter are those with $j$ close to $\mu=2^{h-k}$. Replacing $j$ with $2^{h-k}$ in $\ref{fgrieu7}$, then applying $\ref{fgrieu6}$, we get $$\begin{align}\lim\limits_{h\to\infty}H(h,k)&=2^{-h}\sum_{j=1}^{2^h}E(n_j)\,2^{h-k}\,\bigl(h-(h-k)\bigr)\\ &=2^{-k}\,k\sum_{j=1}^{2^h}E(n_j)\\ &=k\\ \end{align}$$ ^{I wish I knew how to properly justify "This implies".}
Thinking about rigorously proving $\lim\limits_{h\to\infty}h-H(2h,h)=0^+$.

^{But don't hold your breath.}

Have a look at the proof of the Left Over Hash lemma. It has your proof and demonstrate s that your (1) and (2) are correct. Not the 0.827 thing though as $e^{h-h} =1$. — Paul Uszak, Apr 24 '23 at 07:39
@Paul Uszak: Are you suggesting Leftover Hash Lemma, Revisited, or A Pseudorandom Generator from any One-way Function, or some other source? Note: that $\lim\limits_{h\to\infty}h-H(h,h)=0.827…$ is easily verified experimentally with usual hashes truncated to e.g. 16 to 35 bit, by determining the $n_j$ and applying (5) [which is slightly easier than applying (2)]. I report on such experiment here and also link to a thesis that (among other things) did just that. — fgrieu, Apr 24 '23 at 07:50

ManRow · Accepted Answer · 2023-05-24T02:19:21.857

fgrieu has shown that $\lim\limits_{h\to\infty} H(h,k) = k$ for any fixed $k$, so I will try to address the second question, that is, whether or not $$\lim\limits_{h\to\infty} h - H(2h,h) = 0$$ In particular, if we let $n = 2^h$, then we are asking whether $$\lim\limits_{n\to\infty} \lg(n) - n\sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\left[\left(\frac{j}{n^2}\right)\lg\left(\frac{n^2}{j}\right)\right] = 0$$ or equivalently, whether $$\lim\limits_{n\to\infty} \sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\left[\lg(n)+\left(\frac{j}{n}\right)\lg\left(\frac{j}{n^2}\right)\right] = 0$$ Note that as $n\to\infty$, the quantity inside the sum $$\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j} \approx \mathcal{N}(j \,\vert\, n,n-1)$$ where $\mathcal{N}(x \,\vert\, \mu,\sigma^2)$ is the PDF of the normal distribution with mean $\mu$ and variance $\sigma^2$.

Which we can simply plug into Wolfram Alpha to obtain: $$\lim\limits_{n\to\infty} \sum_{j=0}^{n^2}\mathcal{N}(j \,\vert\, n,n-1)\times\left[\lg(n)+\left(\frac{j}{n}\right)\lg\left(\frac{j}{n^2}\right)\right] = 0$$

**Edit -- using "Wolfram Alpha" to obtain an answer is not a very rigorous approach (as @kodlu has also pointed out in the comments as well). So, I'll attempt a more rigorous solution here below...

In particular, where $n = 2^h$ just as above, we can express $$ H(2h, h) = \sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\left[\left(\frac{j}{n}\right)\lg\left(\frac{n^2}{j}\right)\right] $$ and observe that this quantity from inside the sum $$\left[\left(\frac{j}{n}\right)\lg\left(\frac{n^2}{j}\right)\right]$$ is bounded from below by the slightly different quantity $$\frac{1}{\ln 2}\times\left[-2\left(\frac{j}{n}\right)^2 + \left(3+\ln n\right)\left(\frac{j}{n}\right) - 1\right]$$ such that $$ H(2h,h) \ge \sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\times\frac{1}{\ln 2}\left[-2\left(\frac{j}{n}\right)^2 + \left(3+\ln n\right)\left(\frac{j}{n}\right) - 1\right] \\ = \lg(n) - 2\left(\frac{n - 1}{n^2\ln 2}\right) = h - 2\left(\frac{2^h - 1}{4^h\ln 2}\right)$$ and so $$\lim\limits_{h\to\infty} h - H(2h,h) \le \lim\limits_{h\to\infty} 2\left(\frac{2^h - 1}{4^h\ln 2}\right) = \boxed{0}$$

**Edit #2 -- it seems the same "lower-bound" approach can also be used to answer the first question (demonstrate that $\lim\limits_{h\to\infty} H(h,k) = k$ for any fixed $k$) as well!

In particular, where $n = 2^h$ just as above, and $m = 2^k$, we can express $$H(h,k) = m\sum_{j=0}^{n}\binom{n}{j}\left(\frac{1}{m}\right)^j\left(1-\frac{1}{m}\right)^{n-j}\left[\left(\frac{j}{n}\right)\lg\left(\frac{n}{j}\right)\right]$$ and observe that this quantity from inside the sum $$\left[\left(\frac{j}{n}\right)\lg\left(\frac{n}{j}\right)\right]$$ is bounded from below by the slightly different quantity $$\frac{1}{\ln 2}\times\left(\frac{j}{n}\right)\left[-m\left(\frac{j}{n}\right) + \ln(m) + 1\right]$$ such that $$ H(h,k) \ge m\sum_{j=0}^{n}\binom{n}{j}\left(\frac{1}{m}\right)^j\left(1-\frac{1}{m}\right)^{n-j}\times\frac{1}{\ln 2}\left(\frac{j}{n}\right)\left[-m\left(\frac{j}{n}\right) + \ln(m) + 1\right] \\ = \lg(m) - \frac{m-1}{n\ln 2} = k - \frac{2^k-1}{2^h\ln 2}$$ and so $$\lim\limits_{h\to\infty} H(h,k) \ge \lim\limits_{h\to\infty} k - \frac{2^k-1}{2^h\ln 2} = \boxed{k}$$

How do you rigorously justify the statement "quantity inside..."? And then demonstration by Wolfram alpha is not rigorous either. — kodlu, Apr 24 '23 at 03:57
@kodlu The "quantity inside" is a binomial PMF that, under certain conditions, can become accurately approximated by the normal PDF... — ManRow, Apr 24 '23 at 09:32
@kodlu But yes -- getting an answer from "Wolfram Alpha" does not count as rigorous... — ManRow, Apr 24 '23 at 23:53
I've modified my answer but the lack of a rigorous proof for the last lineshould be addressed if possible. — kodlu, Apr 25 '23 at 13:58
@kodlu I've added a more rigorous proof to my answer here above** — ManRow, Apr 26 '23 at 01:51
sorry I cannot follow the expression below the line "slightly different quantity". It seems you dropped a logarithm somewhere? — kodlu, Apr 26 '23 at 03:19
@kodlu Good catch! Just fixed that expression there (for the lower bound on $H(2h,h)$) — ManRow, Apr 27 '23 at 01:32
Can you explain how you obtain the expression right above "and so" at the end? — kodlu, Apr 29 '23 at 12:34
@kodlu Sure! You can notice that there are three terms inside the bracketed expression $$\frac{1}{\ln 2}\times\left[-2\left(\frac{j}{n}\right)^2 + \left(3+\ln n\right)\left(\frac{j}{n}\right) - 1\right]$$ inside the sum (for the lower bound on $H(2h,h)$), each of which is just some power of $j$ (either 0, 1, or 2), multiplied by some coefficient. — ManRow, Apr 29 '23 at 23:28
@kodlu Consequently, you can split the whole sum (for the lower bound on $H(2h,h)$) into three different sums, and then easily obtain a closed-form expression for each one, simply just by applying the following general formulae: [edited]$$ \begin{align} & \sum_{k=0}^n\binom{n}{k}p^k\left(1-p\right)^{n-k}\times k^0 = 1 \ & \sum_{k=0}^n\binom{n}{k}p^k\left(1-p\right)^{n-k}\times k^1 = np \ & \sum_{k=0}^n\binom{n}{k}p^k\left(1-p\right)^{n-k}\times k^2 = np(p(n-1)+1) \end{align}$$ (and, of course, with the necessary substitutions!) — ManRow, Apr 29 '23 at 23:29

kodlu · Answer 3 · 2023-04-25T13:57:20.200

Edit: I was wrong below, as pointed out in the comments by @ManRow:

@ManRow, unfortunately you cannot use the Normal approximation to the binomial for answering Question 2. This is because}<\strike> the usual rule of thumb (see these stats notes, scroll to the bottom) for using this approximation (based on the deMoivre Laplace approximation Wikipedia) which is $$ np(1-p)\geq 10,\quad or \quad np(1-p)\geq 5~(\mathrm{less~ stringently}) $$ does not hold for $p=1/n.$

If we use the Poisson approximation, see David Pollard's notes at Yale here. We can replace $\mathrm{Bin}(n,p)$ by $\mathrm{Poi}(np),$ and LeCam has famously shown that $$ \sum_{k=0}^\infty |P(X=k)-P(Y=k)|\leq 4p $$ if $X\sim \mathrm{Bin}(n,p),$ and $Y\sim \mathrm{Poi}(np)$ which is amazing since you are summing over all natural numbers; recall that for us $p=1/n$.

Another technique we can use is the so-called Poissonization. The balls in Bins process is negatively correlated when you consider distinct bins, if a bin has more balls, another one has fewer. This is quite technical and delicate (Mitzenmacher and Upfal's Probability and Computing book discusses it, but I will skip the details). This actually means that if we pretend that the number of balls falling in distinct bins arrive at some fixed poisson rate $np,$ the approximations we need still hold in a very strong sense.

Therefore, we can approximate the binomial distribution in this problem with the relevant Poisson distribution. Recall that we have $m=n^2$ balls into $n$ bins thus we have the rate $\lambda=mp=n^2(1/n)=n.$

The recent paper https://arxiv.org/pdf/1001.2897.pdf has tight upper and lower bounds on the entropy of the Poisson distribution. This essentially means that for any $\lambda>0,$ we have (bottom of page 2) the approximation to the Poisson entropy $$ H(\lambda)=\frac{1}{2}\ln(2\pi \lambda)+\frac{1}{2} -\frac{1}{12\lambda}-\frac{1}{24 \lambda^2} +O(\lambda^{-3}), $$ and for $\lambda=n$ large ($n=2^h$) we obtain the approximation $$ H\approx \frac{1}{2} \ln(2 \pi n)+\frac{1}{2}\approx \frac{1}{2}(\ln 2\pi + \ln n) \approx \frac{1}{2 \ln 2} \lg n \approx 0.72 \lg n=0.72 h $$ which does not quite get us to the desired $$ \lim\limits_{h\to\infty}h-H(2h,h)=0^+ $$ since we can only conclude that the limit is $0.28 h.$ Maybe this is the real answer.

This seems like a misunderstanding of the variables involved. In the quantity below, $$\binom{\boxed{a^2}}{k}\left(\frac{1}{a}\right)^k\left(1-\frac{1}{a}\right)^{\boxed{a^2}-k}$$ we have that $n=\boxed{a^2}$ and $p = 1/a$, so therefore $$\\mu = np = a^2 \times (1/a) = a \\sigma^2 = np(1-p) = a^2 \times (1/a)(1-1/a) = a-1$$ both of which tend to infinity as $a\to\infty$. — ManRow, Apr 26 '23 at 08:00
So, just by replacing $a$ with $n$, and $k$ with $j$, it seems the normal approximation for $$\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j} = \mathcal{N}(j~\vert~ n, n-1)$$ (from inside the sum) may very well apply here as $\mu = n$ and $\sigma^2 = n - 1$ easily satisfy the criteria for normal approximation with both $\mu\gg 10$ and $\sigma^2\gg 10$ as $n\to\infty$ as well. — ManRow, Apr 26 '23 at 08:01
The case you're referring to with $\mu = 1$ and $\sigma^2 = 1-1/n$ seems to apply to a different question answered by fgrieu regarding $\lim\limits_{h\to\infty}h - H(h,h)$, not the $\lim\limits_{h\to\infty}h - H(2h,h)$ case as over here... — ManRow, Apr 26 '23 at 08:02

Maximum entropy of a hash function?

3 Answers3

Notations and preliminaries

Answer