Can I treat SHA-256 hashes as 64 fair dice rolls with numbers between 1 and 16?

Question

My understanding was that SHA-256 is pretty random or "random" enough.

I assumed that would mean that every character would behave like a 1 to 16 dice roll.

With this assumption, I would expect that you can model the probability of repeating characters as $16^x$. So a chain of $\texttt{FFF}$ or $\texttt{333}$ would have a chance of 1 to $16^3 (4096)$ and a chain of $\texttt{FFFF}$ a chance of 1 to $16^4 (65536)$.

But while generating a lot of hashes (with random UUIDs as seed) to confirm my assumption the numbers do not add up. For example, in a set of 100k hashes I already have more than 1k chains of 4 characters or more (while I was expecting between 1 and 2 chains).

So here I am trying to understand why my assumption was wrong in the first place.

Did I fundamentally misunderstood the randomness of SHA-256 hashes or is it something else?

Not a clear question since your experiment is not clear. See my SHA-1's experiment on the leading zeros. How do you have 1K chain? Note that we model SHA-256 as a Pseudo-Random Function we don't know it is one one not. — kelalaka, Mar 02 '22 at 22:07
Your model missing a point that in 64-hex output of SHA-256 you need to find the probability of a sequence of 4 characters anywhere. You cannot really model as each output hex as a single roll since they are not independent of the input... — kelalaka, Mar 02 '22 at 22:13

score 4 · Accepted Answer · edited Mar 03 '22 at 16:59

4

So a chain of $\texttt{FFF}$ or $\texttt{333}$ would have a chance of 1 to $16^3 (4096)$

Actually, a chance of three repeated nybbles (be it $\texttt{FFF}$ or $\texttt{333}$ or $\texttt{000}$) would be 1 in $16^2 (256)$ - that happens because there are $16^3$ equally likely values of those 3 nybbles, and 16 of those patterns are repeats - hence the probability of a repeat is ${16 \over 16^3} = {1 \over 16^2}$. If you specify that they must be $\texttt{FFF}$ (and so $\texttt{333}$ would not count), you'd then get $16^3$; however that's not what you're doing.

For example in a set of 100k hashes I already have over 1k chains of 4 characters or more

That's about right - in 100k hashes, there are roughly 6,000,000 places where a string of 4 repeated nybbles might occur; any one place has a probability of $16^{-3} = {1 \over 4096}$ of being a repeat - a simplistic computation gives about an expected 1,400 strings of repeats.

I say simplistic, because this straight-forward computation ignores overlapping strings - for example, a string of 5 repeated nybbles would count as a run, not 2 runs of 4. In addition, the probabilities involved with overlapping strings are not independent. While these effects reduce the expected total somewhat, I believe that the simplistic computation is good enough for a back-of-the-envelope estimate.

edited Mar 03 '22 at 16:59

Maarten Bodewes

92,551
13
161
313

answered Mar 02 '22 at 22:18

poncho

147,019
11
229
360

thank you very much! By pointing out the error in my assumption i was able to understand where the problem is and with this video https://www.youtube.com/watch?v=O4Qnsubo2tg i was able to understand how i have to adjust my function – braunbaer Mar 02 '22 at 22:41
tbh i am still a little confused on why a chance of 1/4096 does not mean on avgr of 100k / 4096 outcomes , because that would be ~24. – braunbaer Mar 02 '22 at 23:30
@braunbaer Because in a 64 char hex string, there are 61 possible positions where there can be a 4-hex string sequence. For each of those positions, the chances of the first char being the same as the next three chars is (1/16)^3 = (1/4096). Hence, quad-repeating-hex sequences per hash will be (1/4096 * 61) = 0.01489257812. Per 100k hashes, that's 0.01489257812 * 100k = 1489. – knaccc Mar 03 '22 at 02:01
@knaccc yes! that makes so much sense. So just to be clear, if we would work with a 4 char hex string we would have a "plain" chance of 1/4096 as there is only one possible position for a 4-hex string sequence or (1/4096 *1) to be clear – braunbaer Mar 03 '22 at 08:46
@braunbaer yes, exactly. 1/4096 chance of all hex chars being the same, which is another way of saying that the 2nd, 3rd and 4th char are all the same as the first. – knaccc Mar 03 '22 at 11:03
This Q/A made HNQ, so I edited the question to be representative - which means also updating the answer of course - hope you don't mind. – Maarten Bodewes Mar 03 '22 at 17:01

Can I treat SHA-256 hashes as 64 fair dice rolls with numbers between 1 and 16?

1 Answers1