I read a question about using a 32bit random number as a "unique" identifier. One of the answers asserted that a "rough guide" to estimating collision likelihood is that there's a 50% chance of a collision once about sqrt(n) numbers have been chosen. This would mean that for a 32bit number you'd have about a 50% chance of a duplicate after 65536 selections (and it linked to a wiki page that seemed to suggest this). I wrote a program to convince myself - and it's sort of done the opposite.
Essentially, the program Generates a fixed number of 32 bit randoms and puts them in a binary tree. If the node already exists it terminates.
I started with the "fixed number" being 70,000 - thinking that should hit the 50% threshold and I'd see it terminate every couple of runs. It didn't happen, so I upped the number to 700,000 - and it still didn't happen (no matter how many times I ran it).
It occurred to me that maybe it's a result of the RNG (I can convince myself that random number generators could become cyclical with repetition - but I thought there were some mechanisms under the hood that basically meant you can have number occur more than once without any obvious order forming).
Anyway, I've upped the number per run to 1 million. I've also stored the output of multiple runs and combined them looking for a duplicate (this is also sort of nice because each run will have a different seed so (I would have thought) that might counteract sequence problems that were specific to the RNG. So far, I've generated over 14 million numbers and haven't detected a duplicate.
Does anyone have some thoughts about what might be going on?
- a) Is the birthday paradox estimate only good for much smaller numbers? (and/or is the original assumption that sqrt(n) would be a reasonable estimate flawed)
- b) Would something about how an RNG is implemented contribute to unexpected results that make collisions less likely?
- c) Any other thoughts.....
Obviously there could be an issue in my code (and I would otherwise think this is the most likely issue), but it is listing out all the numbers sorted, and then I'm using "sort -u" to combine multiple outputs into my list of 14 million (If there was a duplicate this step should result in less than 14million numbers). So even if the code has an error where in any run of it I got bad results, I'd sort of expect to be able to detect that from the broader data set.....
EDIT in case anyone cares: Turns out it was a RNG problem (or more specifically my choice of RNG (C rand)). Changed to a different RNG (random) and now I'm getting more expected results. Meanwhile I've generated over 21 million numbers with no repeats yet....
Doc Brown - actually wasn't asking anyone to debug, was more interested whether the birthday paradox scales up - because it didn't seem to (admitting there may well be a problem within my code). And that seems to have been answered....
– user679560 Dec 06 '23 at 02:46