0

I have a set $Q$ that is composed of $n_Q$ integers: $Q = \{Q_1,Q_2,...,Q_{n_Q}\}=\{1, 2, ..., n_Q\}$

I now have $N_D$ subsets each of which is designated $D_i$ with length $n_{D_i}$. Each subset is composed of numbers randomly drawn from $Q$ without replacement: $D_i = \{Q_a, Q_b, ..., Q_z\}$

What I'd like to solve for is the number of unique integers found in the union of each of the subsets. In other words, the number of unique integers in $D_T=D_1 \cup D_2 \cup ... \cup D_{N_D}$.

I'd ideally like to express this as a PMF, after which I could compute expected values, confidence intervals, etc.

I've seen some examples solving simpler problems, but I'm struggling with the recursion when I try to generalize this. Any insight would be appreciated!

David
  • 177
  • 1
  • 7

1 Answers1

1

(This pertains to the original question referring to $\,\text{card}\left(\bigcap_{i}\,D_i\right)$ instead of $\,\text{card}\left(\bigcup_{i}\,D_i\right)$.)

Although it's not the PMF, here's a derivation of the expected value that might be useful.

Let the set $Q$ have $m$ elements, and let $D_1,...D_n$ be independent size-$k$ random subsets of $Q$ (each formed by sampling $k$ elements without replacement from $Q$). Let $X^i_1,...,X^i_m$ be the list of indicators for the sample that forms $D_i$; i.e., $$X^i_j=\begin{cases}1&\text{if the $j$th element of $Q$ is in the $i$th sample}\\ 0&\text{otherwise}.\end{cases}$$ Then $$\begin{align}N::&= \text{card}\left(\bigcap_{i=1}^n\,D_i\right) \\ &=\ \sum_{j=1}^m\prod_{i=1}^nX^i_j\end{align}$$ and $$\begin{align}E(N) &= \sum_{j=1}^mE\left(\prod_{i=1}^nX^i_j\right)\\[2ex] &=\sum_{j=1}^mP\left(\bigwedge_{i=1}^n(X^i_j=1)\right)\\[2ex] &=\sum_{j=1}^m\prod_{i=1}^nP(X^i_j=1)\quad\text{because the samples are independent}\\[2ex] &=m\,P(X^i_j=1)^n\quad\text{because $P(X^i_j=1)$ is independent of $i$ and $j$}\\[2ex] &=m\,\left({k\over m}\right)^n. \end{align}$$

r.e.s.
  • 14,371