Random Numbers and Collisions

Question

I read a question about using a 32bit random number as a "unique" identifier. One of the answers asserted that a "rough guide" to estimating collision likelihood is that there's a 50% chance of a collision once about sqrt(n) numbers have been chosen. This would mean that for a 32bit number you'd have about a 50% chance of a duplicate after 65536 selections (and it linked to a wiki page that seemed to suggest this). I wrote a program to convince myself - and it's sort of done the opposite.

Essentially, the program Generates a fixed number of 32 bit randoms and puts them in a binary tree. If the node already exists it terminates.

I started with the "fixed number" being 70,000 - thinking that should hit the 50% threshold and I'd see it terminate every couple of runs. It didn't happen, so I upped the number to 700,000 - and it still didn't happen (no matter how many times I ran it).

It occurred to me that maybe it's a result of the RNG (I can convince myself that random number generators could become cyclical with repetition - but I thought there were some mechanisms under the hood that basically meant you can have number occur more than once without any obvious order forming).

Anyway, I've upped the number per run to 1 million. I've also stored the output of multiple runs and combined them looking for a duplicate (this is also sort of nice because each run will have a different seed so (I would have thought) that might counteract sequence problems that were specific to the RNG. So far, I've generated over 14 million numbers and haven't detected a duplicate.

Does anyone have some thoughts about what might be going on?

a) Is the birthday paradox estimate only good for much smaller numbers? (and/or is the original assumption that sqrt(n) would be a reasonable estimate flawed)
b) Would something about how an RNG is implemented contribute to unexpected results that make collisions less likely?
c) Any other thoughts.....

Obviously there could be an issue in my code (and I would otherwise think this is the most likely issue), but it is listing out all the numbers sorted, and then I'm using "sort -u" to combine multiple outputs into my list of 14 million (If there was a duplicate this step should result in less than 14million numbers). So even if the code has an error where in any run of it I got bad results, I'd sort of expect to be able to detect that from the broader data set.....

EDIT in case anyone cares: Turns out it was a RNG problem (or more specifically my choice of RNG (C rand)). Changed to a different RNG (random) and now I'm getting more expected results. Meanwhile I've generated over 21 million numbers with no repeats yet....

I strongly suspect your program is wrong. I just redid the experiment with a Python script (70000 numbers randomly from 0 to 2^32, using Python's built-in set data structure to detect duplicates), and get around 50% of runs featuring duplicates, as the math predicts. So my leading hypotheses for you are: your tree is being used incorrectly, or you have a problematic RNG strategy. But you don't give any specifics about these aspects, so IDK. — amon, Dec 04 '23 at 22:43
Did you read the Wikipedia article about the birthday problem? See section "Generalizations". It tells you the exact value for getting a 50% chance of collisions in the range of 2^32 is 77163. And you are basically asking us helping you to debug your program, which would be off-topic here even when you have showed us the code. Without the code, however, debugging any program is ridiculous. ... — Doc Brown, Dec 04 '23 at 22:47
I can only tell you when you implemented a binary tree by yourself, you have opened a can of worms for potential bugs. Why not use a standard hashset? Any popular high level programming language of today has support for this. — Doc Brown, Dec 04 '23 at 22:51
The first thing you should have done is reduced the problem space. How about picking a random value from a list of 4 values - does it find duplicates when you run it 5 times? Pigeonhole principle: it should. If it doesn't, you've just proven that your code does not what you think it does. — Flater, Dec 05 '23 at 05:11
amon - Thanks I guess that was sort of what I was after - I was surprised that my results were a long, long way from what was expected. But I have saved all the numbers as they're removed from the tree and they are definitely unique (although that doesn't exclude the tree as the problem).
Doc Brown - actually wasn't asking anyone to debug, was more interested whether the birthday paradox scales up - because it didn't seem to (admitting there may well be a problem within my code). And that seems to have been answered.... — user679560, Dec 06 '23 at 02:46
You didn't mention (until much later) which RNG you were using. It's always a good idea to test your RNG to make sure it has the numerical properties you want. A number of commonly-used RNGs are of notably poor quality, one that I'm more familiar with is the one in VBA. — Craig, Dec 06 '23 at 15:07

score 5 · Accepted Answer · edited Dec 05 '23 at 02:34

5

99.9% chance: your code is broken.

import random
seen = set()
random.seed(42) # or use random.seed(None) to auto-generate a seed
while True:
    i = random.randrange(2**32)
    if i in seen: break
    seen.add(i)
print(f'Found a duplicate after {len(seen)} entries')

returns (for me on my current machine anyway, can't remember quite how stable Python's pRNG is) Found a duplicate after 72954 entries.

0.1% chance: your language's pRNG is very, very weird

edited Dec 05 '23 at 02:34

Alexander

4,884

answered Dec 04 '23 at 22:46

Philip Kendall

24,083

Thanks - that's quite useful – user679560 Dec 06 '23 at 02:45
If you have a 32 bit linear congruential random number generator with maximum period then you will actually get close to 2^32 different values, and then the sequence repeats. – gnasher729 Dec 08 '23 at 23:02

Random Numbers and Collisions

1 Answers1

99.9% chance: your code is broken.

0.1% chance: your language's pRNG is very, very weird