1

I have a very challenging problem that I cannot find a way to solve without python simulation.

Given a dataset of size X (very large number), we want to select H entries from X without replacement and the order doesn't matter. Then, repeat this process N times.

Selecting H from X follows a uniform distribution.

How can I estimate the total numbers of entries that would be selected multiple times?

I know that for a very large X, selection without replacement is not going to be that different from selection with replacement, so I tried modeling it using the binomial theorem, but I cannot wrap my head around how to start the calculation.

  • Can you please clarify what exactly you mean by replacement? I seem not to understand that... – donaastor Dec 08 '22 at 22:12
  • @donaastor sure, by replacement I mean we can select the same number more than once by placing it back into the pool of options to select from – Mina Ashraf Dec 08 '22 at 22:14
  • Ok so you mean you pick $H$ elements but you don't remove them, you just "marked" them and then in next picking you are actually giving the next session of marks to all the elements including those already marked. And then you ask what is the estimate of the number of elements that are marked at least once in that process, right? And of course the probability for picking is constant $\frac{1}{X}$ for all elements all the time. Right? – donaastor Dec 08 '22 at 22:16
  • @donaastor yes, exactly – Mina Ashraf Dec 08 '22 at 22:17
  • I think you should take a look at Stirling numbers, especially those of the second kind. They won't solve your problem directly but they will perhaps should you how you could deal with situations like this and derive your own recursion formulas. – donaastor Dec 08 '22 at 22:19
  • @donaastor if we let H equal 1, wouldn't this be a variation of the birthday problem? – Mina Ashraf Dec 08 '22 at 22:22
  • What If we let each person have "H" birthdays? – Mina Ashraf Dec 08 '22 at 22:23
  • The way I would attempt to make those formulas is to parametrize all situations: by $X$, $H$, number of this "marking" process, say number $M$ and the number of marked elements, say $E$. Then I would define $f(X,H,M,E)$ to be the number of ways to get to that state and then I would look for relations between close tuples, for example the relation between $f(X,H,M,E)$ and $f(X,H,M+1,E+k)$ (I gave this at random). I am sure you will be able to derive some recursion with this approach. – donaastor Dec 08 '22 at 22:25
  • $H=1$ is already very difficult to analyze... – donaastor Dec 08 '22 at 22:26
  • btw actually estimating the number is easy, it is just $(1-\frac{H}{X})^N$ – donaastor Dec 08 '22 at 22:29
  • For $N=2$ you can use the probability here. You may try to extend it for $N \gt 2$. – Fabius Wiesner Dec 08 '22 at 22:36

1 Answers1

1

For each entry, the probability it is selected in a particular round is $H/X$. Therefore, over the course of $N$ rounds, the number of times that entry is selected follows a binomial distribution with parameters $N$ and $H/X$. In particular, the probability the entry is selected twice or more is $$ \text{probability of being selected twice or more } = 1-(1-p)^N-\binom N1p(1-p)^{N-1}\tag{$*$} $$ where $p=H/X$. Finally, using linearity of expectation, the expected number of entries selected multiple times is the number of entries times the probability for each entry. $$ \text{expected # entries selected twice or more }=X\left[1-(1-p)^N-\binom N1p(1-p)^{N-1}\right] $$ Furthermore, let $q$ be the probability in $(*)$. The expected number of entries selected twice or more is $qX$, but what the variance? We cannot use a binomial distribution, since the events of being selected twice or more are dependent. However, since these events are negatively correlated, the variance is bounded above by the variance of a binomial. That is, $$ \text{variance of # entries selected twice or more }\le q(1-q)X. $$ So now you know the mean, and have an upper bound for the spread.

Mike Earnest
  • 75,930