1

I want to transform a semi random input to a shorter, uniformly random bit string. Assuming there is enough entropy in the semi random input, can I use a collision resistant hash function to extract randomness?

Update
After the first answer pointed out that the body of the question is not correctly asking the intended question, I didn’t want to edit the question, so that the answer given wouldn’t be confusing for the other readers. But the linked question, that is said to be duplicate, definitely does not answer my question.

1 Answers1

1

It's asked different things in title and body, leading to radically different answers.

Can I use a cryptographic hash function such as sha256 for Randomness Extraction ?

Yes for standard cryptographic hash functions including SHA-256, any SHA-2 or SHA-3 function, Blake2b (even SHA-1 or MD5 if one does not care giving a poor first impression), and any of their truncation. They are suitable to transform a semi random input into a shorter, uniformly random bit string, assuming there is enough entropy in the semi random input, and assuming that this input is not prepared with knowledge of hash function. To illustrate why that last condition is required: if the input $x$ was filtered or tweaked so that $\operatorname{SHA-256}(x)$ has it's first bit $0$, then $\operatorname{SHA-256}(x)$ would be easily distinguishable from random.


I want to transform a semi random input to a shorter, uniformly random bit string. Assuming there is enough entropy in the semi random input, can I use a collision resistant hash function to extract randomness?

No, in general. Collision resistance is not a sufficient property. E.g. the function defined by $F(x)=\operatorname{SHA-256}(x)\mathbin\|\operatorname{SHA-256}(x)$ is just as collision-resistant as SHA-256 is, but is not suitable for randomness extraction as it's output is easily distinguishable from random.

Collision resistance is not a necessary property either. E.g. SHA-256 truncated to it's first 8 bytes is not collision-resistant, but is suitable for randomness extraction.


Security is met if the entropy extraction function is a random member of a computationally secure Pseudo-Random Function Family (PRF).

Again, it's critical that the "semi random input" is not somewhat biased with knowledge related to the function used for randomness extraction. For example: no matter how large and random most of the input is, if one input bit is prepared with knowledge of the rest of the input and of the otherwise ideal extraction function, then it's easy to get the first bit of the output equal to 0 with probability near 75%.


Update following comment:

I found some (direct academic source of the "yes") for HMAC, but if the hash function is not keyed, would this still be the case?

Yes for any standard cryptographic hash function, if it's usable to instantiate a PRF (when the kind of length-extension property SHA-256 has does not matter). But I'm afraid I can't point a source, because my "not somewhat biased with knowledge related to the function used for randomness extraction" condition is both necessary, and hard to formalize other than as without knowledge of the key of a PRF, and there's no key in a hash. Hence the formal statement using HMAC, which has a key input and is an archetypal PRF.

We can rightly argue that the constants in the definition of a hash like SHA-256 in practice (and proof of the Merkle-Damgård construction used in SHA-256) plays the role of a key. It's widely believed that SHA-256 passes the PRF experiment when we use the 32-byte IV as key; and that's even provable under an ideal cipher model for the cipher underlying the Davies-Meyer compression function. But that's not part of the design goals of SHA-256, nor a very standard result, and we need lesser hypothesis to prove that HMAC is a PRF.

fgrieu
  • 140,762
  • 12
  • 307
  • 587
  • Could you point me to a source for your ‘yes’ answer? That was also my intuition but, I couldn’t find any direct academic sources, I found some for HMAC but if the hash function is not keyed, would this still be the case? – ONUR EREN ARPACI Nov 23 '22 at 13:14
  • @ONUR EREN ARPACI: I can't point a source; see update. – fgrieu Nov 23 '22 at 14:53
  • @fgrieu, couldn't you make a case for a generic case that in the random oracle model we have an extractor using a hash function? So we could use lemma 2 of https://eprint.iacr.org/2010/264.pdf? As you point out, SHA2 is not really a good generic random oracle. Hence, we require HMAC that can be proven indefferentiable from a RO. – Marc Ilunga Nov 23 '22 at 15:24
  • @Marc Ilunga: yes, a random oracle can easily be proven to be a perfect random extractor function in the question's sense. But any hash is distinguishable from a random oracle when you know the hash's definition (just submit any particular message and compare the outcome to what the hash's definition imply). Thus we need a formalization of "the random input is not somewhat biased with knowledge related to the function used for randomness extraction", and the standard such formalization is a keyed PRF with key unknown to what generates the input of the random extractor. And hashes have no key. – fgrieu Nov 23 '22 at 16:54
  • @fgrieu, "just submit any particular message and compare the outcome to what the hash's definition imply", a true random function would be "vuneralbe" to this attack as well right? After all, even a random function is still deterministic for a fixed randomness. Correct me if I am wrong, but an indifferentiability (not indistinguishability) result on a given hash function would capture the fact that we could use it in place of a RO (modulo assumptions on the internals). – Marc Ilunga Nov 23 '22 at 17:13
  • @Marc Ilunga: yes, a true random function is vulnerable. Lemma 2 in your reference succinctly repels that with "and independent from H", then discusses why that's necessary and what that means in no less than two paragraphs below the lemma. It ends up with "model H as a family of random functions indexed by a key rather than as a single deterministic function", so we are essentially back to what HMAC with a fixed key does, only it's not named HMAC. – fgrieu Nov 23 '22 at 18:07
  • @fgrieu, yeah, that's a good point regarding independence. But if the keying material can be assumed to be independent, my question remains then: If H was shown indifferentiable from a RO, can't it then be a good extractor? With that said, in practice, I would always use a salted HKDF. – Marc Ilunga Nov 23 '22 at 21:19
  • Re: "Security is met if..." There is no concept of security in randomness extraction functions. Absolutely anything works that provides a uniform(ish) output. XOR, bit dropping, CRC16, Pearson, MD5, anything that you've written yourself, e.t.c. It's one of the counter examples of why you can't roll your own algorithm in cryptography. – Paul Uszak Nov 23 '22 at 21:37
  • @PaulUszak "Absolutely anything works that provides a uniform(ish) output", wouldn't this be a characterization of some notion of security? – Marc Ilunga Nov 23 '22 at 21:45
  • @PaulUszak: an example of "semi random input" that no amount of bit dropping will turn "to a shorter, uniformly random bit string" is bits from a biased coin toss. And if I know that the output is conditioned by MD5, I can arrange that for "semi random inputs" with all bits truly random and a single apparently random other that I craft, the MD5 output consistently fails a thorough ENT test. The condition for robust randomness extraction when only "assuming there is enough entropy in the semi random input" is far from trivial! – fgrieu Nov 23 '22 at 22:05
  • @MarcIlunga I think that Mr. Grieu was suggesting that a cryptographically secure, non invertible extraction function is required. That's not the case. TRNG security comes from the underlying raw entropy stream. Entropy $\equiv$ security. The extractor function only has to change a wiggly wobbly distribution into a uniform one. E.g. von Neumann creates a 100% secure TRNG if the input is IID. HMACs are unnecessary. – Paul Uszak Nov 23 '22 at 22:13
  • Not really that hard. Left over hash Lemma, Piling up Lemma and von Neumann. I was generalising on the bit dropping, but there are papers featuring chaotic lasers that just take the lower bits. Didn't know that you were a fan of ENT. I've developed ent2 :-) – Paul Uszak Nov 23 '22 at 22:24
  • Thank you for the discussion, this has been helpful. https://people.seas.harvard.edu/~salil/pseudorandomness/extractors.pdf In page 174 section 6.1.3, it says that you can’t have deterministic extractors for k-sources. And then there is this huge literature for deterministic extractors for specific source models (such as, https://link.springer.com/book/10.1007/978-3-642-14903-0 ) so even if the source is not adversarial (for example, radiation waves coming from sun), I am still not sure if using a specific cryptographic hash function solves the problem. – ONUR EREN ARPACI Nov 24 '22 at 08:37
  • 1
    I would like to add that many CSPRNG's (or, rather, the better named DRBG's) within NIST SP 800-90Ar1 are based on hash functions, see section 10.1: DRBG Mechanisms Based on Hash Functions. The hash function used is commonly the only cryptographic primitive used to create the function. So if you have a hash and don't need top performance, you can just create an official DRBG out of it, possibly including things as reseeding and whatnot. For top security, base it off of HMAC (i.e. a PRF, see the answer), one of the options. – Maarten Bodewes Nov 24 '22 at 14:16