1

I try to implement a lookup or join like functionality in code, which leads me to the following combinatorial challenge which I fail to tackle in the proper way:

Lets assume I have 50 million numbers, for simplicity the numbers from 1 to 50.000.000. Now I split those numbers in groups of size 10.000, so I'll end up with 5.000 groups. For simplicity lets assume the numbers were sorted so that the first group will contain 1...10.000, the next on 10.001...20.000 and so on. Next lets shuffle the numbers and then create groups again. Now we have a second set of groups with randomly distributed numbers.

Now lets look at a group from the first set (for example 1...10.000) and ask: Over how many groups from the second set are those numbers distributed? The best case would be just one group, in case the numbers 1...10.000 by chance end up in a single group. Unlikely, but possible. The worst case would be 5.000 groups.

I would like to ask this question for all groups from the first set (1...10.000, 10.001...20.000, ...) and calculate the average. I would like to know the expected value of this average value if the numbers in the second set are randomly distributed.

If have solid basics in probability and math in general, but the grouping of numbers makes it hard for me to wrap my head around the problem. Even for a single group. The expected average value is even more tricky, because the groups are not independent.

Is this a well known problem with an existing solution somebody could point me to? Is it solvable at all? Any hint how to approach this would be highly appreciated.

Achim
  • 113

1 Answers1

3

We have $G=5000$ groups of size $S=10000$.

This is equivalent to: we place $S=10000$ balls into $G=5000$ urns, at random. What is the expected number of nonempty urns?

As in here, letting $X_i=1$ iff the urn $i$ is nonempty, we have $$P(X_i=1)=E(X_i)=1-(1-1/G)^S \approx 1 - \exp(-S/G)$$

and the expected number of nonempty urns is $G ( 1 - \exp(-S/G)) \approx 4323.3$

If you average over all the groups, the result should be the same.

leonbloy
  • 63,430