3

My understanding was that SHA-256 is pretty random or "random" enough.

I assumed that would mean that every character would behave like a 1 to 16 dice roll.

With this assumption, I would expect that you can model the probability of repeating characters as $16^x$. So a chain of $\texttt{FFF}$ or $\texttt{333}$ would have a chance of 1 to $16^3 (4096)$ and a chain of $\texttt{FFFF}$ a chance of 1 to $16^4 (65536)$.

But while generating a lot of hashes (with random UUIDs as seed) to confirm my assumption the numbers do not add up. For example, in a set of 100k hashes I already have more than 1k chains of 4 characters or more (while I was expecting between 1 and 2 chains).

So here I am trying to understand why my assumption was wrong in the first place.

Did I fundamentally misunderstood the randomness of SHA-256 hashes or is it something else?

Maarten Bodewes
  • 92,551
  • 13
  • 161
  • 313
braunbaer
  • 133
  • 5

1 Answers1

4

So a chain of $\texttt{FFF}$ or $\texttt{333}$ would have a chance of 1 to $16^3 (4096)$

Actually, a chance of three repeated nybbles (be it $\texttt{FFF}$ or $\texttt{333}$ or $\texttt{000}$) would be 1 in $16^2 (256)$ - that happens because there are $16^3$ equally likely values of those 3 nybbles, and 16 of those patterns are repeats - hence the probability of a repeat is ${16 \over 16^3} = {1 \over 16^2}$. If you specify that they must be $\texttt{FFF}$ (and so $\texttt{333}$ would not count), you'd then get $16^3$; however that's not what you're doing.

For example in a set of 100k hashes I already have over 1k chains of 4 characters or more

That's about right - in 100k hashes, there are roughly 6,000,000 places where a string of 4 repeated nybbles might occur; any one place has a probability of $16^{-3} = {1 \over 4096}$ of being a repeat - a simplistic computation gives about an expected 1,400 strings of repeats.

I say simplistic, because this straight-forward computation ignores overlapping strings - for example, a string of 5 repeated nybbles would count as a run, not 2 runs of 4. In addition, the probabilities involved with overlapping strings are not independent. While these effects reduce the expected total somewhat, I believe that the simplistic computation is good enough for a back-of-the-envelope estimate.

Maarten Bodewes
  • 92,551
  • 13
  • 161
  • 313
poncho
  • 147,019
  • 11
  • 229
  • 360
  • thank you very much! By pointing out the error in my assumption i was able to understand where the problem is and with this video https://www.youtube.com/watch?v=O4Qnsubo2tg i was able to understand how i have to adjust my function – braunbaer Mar 02 '22 at 22:41
  • tbh i am still a little confused on why a chance of 1/4096 does not mean on avgr of 100k / 4096 outcomes , because that would be ~24. – braunbaer Mar 02 '22 at 23:30
  • @braunbaer Because in a 64 char hex string, there are 61 possible positions where there can be a 4-hex string sequence. For each of those positions, the chances of the first char being the same as the next three chars is (1/16)^3 = (1/4096). Hence, quad-repeating-hex sequences per hash will be (1/4096 * 61) = 0.01489257812. Per 100k hashes, that's 0.01489257812 * 100k = 1489. – knaccc Mar 03 '22 at 02:01
  • @knaccc yes! that makes so much sense. So just to be clear, if we would work with a 4 char hex string we would have a "plain" chance of 1/4096 as there is only one possible position for a 4-hex string sequence or (1/4096 *1) to be clear – braunbaer Mar 03 '22 at 08:46
  • @braunbaer yes, exactly. 1/4096 chance of all hex chars being the same, which is another way of saying that the 2nd, 3rd and 4th char are all the same as the first. – knaccc Mar 03 '22 at 11:03
  • This Q/A made HNQ, so I edited the question to be representative - which means also updating the answer of course - hope you don't mind. – Maarten Bodewes Mar 03 '22 at 17:01