We can actually solve some special cases. Suppose we have $n$ types of
coupons which are drawn in packets of $q$ coupons, with no duplicates
in the packets. We derive closed forms for any packet size and
evaluate them for $q=2.$ Now with $T_{m,n,q}$ the number of ways of
drawing $m$ packets of $q$-subsets of $[n]$ so that all possible
values of $n$ are present we get for the probability of $m$ draws the
closed form
$$P[T=m] = {n\choose q}^{-m}
\sum_{k=1}^q {n\choose n-k} \times T_{m-1, n-k, q}
\times {n-k \choose q-k}
\\ = {n\choose q}^{-m}
\sum_{k=1}^q {n\choose k} \times T_{m-1, n-k, q}
\times {n-k \choose q-k}.$$
This is for $n\gt q$ since the process always halts at the first step
when $n=q.$ The sum variable $k$ is the count of the values missing
before the draw of the subset at position $m$ or alternatively of the
values that appear in draw $m$ for the first time.
To compute the terms in $T$ we have a simple version of the
computation at the following MSE
link and introduce
the generating function
$$[z^q] \prod_{l=1}^n (1+zA_l)$$
which generates the $q$-subsets so that
$$\left([z^q] \prod_{l=1}^n (1+zA_l)\right)^m$$
generates the $m$-sequences of $q$-subsets. We use inclusion-exclusion
to remove those terms where some of the $n$ terms are missing. The
nodes $P\subseteq A$ in the poset represent terms from the generating
function where the elements of $P$ are missing plus possibly some
more. This is evidently accomplished by setting the $A_l\in P$ to
zero. We set the remaining $A_l$ to one to obtain a count. The
contribution for a given $P$ is
$$[z^q] (1+z)^{n-|P|} = {n-|P|\choose q}.$$
Therefore inclusion-exclusion yields
$$\sum_{p=0}^n {n\choose p} (-1)^p {n-p\choose q}^m$$
which is
$$\bbox[5px,border:2px solid #00A000]{ T_{m,n,q} =
\sum_{p=0}^n {n\choose p} (-1)^{n-p} {p\choose q}^m.}$$
This is zero when $m=0$ and $n\ge 1.$ Now as a sanity check we should
have $T_{m,n,1} = {m\brace n} \times n!$ and indeed we obtain
$$\sum_{p=0}^n {n\choose p} (-1)^{n-p} {p\choose 1}^m
= \sum_{p=0}^n {n\choose p} (-1)^{n-p} p^m
= {m\brace n} \times n!$$
and the check goes through. The rest is as shown at the following MSE
link.
Next let us try to verify that we have a probability
distribution. We have
$$\sum_{m\ge 1} P[T=m] =
\sum_{k=1}^q {n\choose k} {n-k\choose q-k}
\sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p}
\sum_{m\ge 1} {n\choose q}^{-m}
{p\choose q}^{m-1}
\\ =
\sum_{k=1}^q {n\choose k} {n-k\choose q-k} {n\choose q}^{-1}
\sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p}
\frac{1}{1-{p\choose q}/{n\choose q}}
\\ =
\sum_{k=1}^q {n\choose k} {n-k\choose q-k}
\sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p}
\left({n\choose q}-{p\choose q}\right)^{-1}.$$
Specializing to $q=2$ we get
$$2\sum_{k=1}^2 {n\choose k} {n-k\choose 2-k}
\sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p}
\frac{1}{n(n-1)-p(p-1)}
\\ = 2n(n-1) \sum_{p=0}^{n-1} {n-1\choose p} (-1)^{n-1-p}
\frac{1}{n(n-1)-p(p-1)}
\\ + n(n-1) \sum_{p=0}^{n-2} {n-2\choose p} (-1)^{n-2-p}
\frac{1}{n(n-1)-p(p-1)}.$$
We evaluate the two sums by residues, using for the first sum
$$f(z) = \frac{(n-1)!}{n(n-1)-z(z-1)}
\prod_{p=0}^{n-1} \frac{1}{z-p}
= -\frac{(n-1)!}{(z-n)(z-(1-n))}
\prod_{p=0}^{n-1} \frac{1}{z-p}.$$
The residue at infinity is zero and the residues at $n$ and $1-n$ are
$$-\frac{(n-1)!}{2n-1} \frac{1}{n!}
- \frac{(n-1)!}{1-2n} \frac{(-1)^n (n-2)!}{(2n-2)!}.$$
We get for the second sum by the same technique
$$-\frac{(n-2)!}{2n-1} \frac{1}{n!}
- \frac{(n-2)!}{1-2n} \frac{(-1)^{n-1} (n-2)!}{(2n-3)!}.$$
Negate and add to get
$$n(n-1) \times \\
\left(\frac{1}{2n-1}\left(\frac{2}{n}+\frac{1}{n(n-1)}\right)
+ \frac{(n-2)!}{1-2n} \frac{(-1)^{n-1} (n-2)!}{(2n-3)!}
\left(1-2\frac{n-1}{2n-2}\right)\right)
\\ = \frac{1}{2n-1} (2n-2+1) = 1.$$
This confirms it being a probability distribution.
We now give a closed form for the expectation. We find
$$\sum_{m\ge 1} m P[T=m] \\ =
\sum_{k=1}^q {n\choose k} {n-k\choose q-k}
\sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p}
\sum_{m\ge 1} m {n\choose q}^{-m}
{p\choose q}^{m-1}
\\ =
\sum_{k=1}^q {n\choose k} {n-k\choose q-k} {n\choose q}^{-1}
\sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p}
\frac{1}{\left(1-{p\choose q}/{n\choose q}\right)^2}.$$
This is
$$\bbox[5px,border:2px solid #00A000]{
{n\choose q} \sum_{k=1}^q {n\choose k} {n-k\choose q-k}
\sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p}
\left({n\choose q}-{p\choose q}\right)^{-2}.}$$
We obtain $nH_n$ when we evaluate this for $q=1$ which is a good check
but not exactly surprising since we have already seen this work at the
other link (the reader is invited to attempt this computation using
the above formula as a starting point, which is easier than what
follows). We now try for a closed form for $q=2$ and get
$$2n^2(n-1)^2 \sum_{p=0}^{n-1} {n-1\choose p}
(-1)^{n-1-p} \frac{1}{(n(n-1)-p(p-1))^2}
\\ + n^2(n-1)^2 \sum_{p=0}^{n-2} {n-2\choose p}
(-1)^{n-2-p} \frac{1}{(n(n-1)-p(p-1))^2}.$$
We use residues as before with the function
$$g(z) = \frac{(n-1)!}{(n(n-1)-z(z-1))^2}
\prod_{p=0}^{n-1} \frac{1}{z-p}
= \frac{(n-1)!}{(z-n)^2(z-(1-n))^2}
\prod_{p=0}^{n-1} \frac{1}{z-p}.$$
Note that
$$\left(\prod_{p=0}^{n-1}\frac{1}{z-p}\right)'
= -\prod_{p=0}^{n-1}\frac{1}{z-p}
\sum_{p=0}^{n-1} \frac{1}{z-p}$$
We get for the residue at $n$
$$(n-1)!
\left(-\frac{1}{(2n-1)^2} \frac{1}{n!} H_n
-\frac{2}{(2n-1)^3} \frac{1}{n!}\right)$$
The residue at $1-n$ yields
$$(n-1)!
\left(\frac{1}{(1-2n)^2}
\frac{(-1)^n(n-2)!}{(2n-2)!} (H_{2n-2} - H_{n-2})
\\ - \frac{2}{(1-2n)^3} \frac{(-1)^n(n-2)!}{(2n-2)!}
\right)$$
For the second sum we get for the residue at $n$
$$(n-2)!
\left(-\frac{1}{(2n-1)^2} \frac{1}{n!} (H_n-1)
-\frac{2}{(2n-1)^3} \frac{1}{n!}\right)$$
and for the one at $1-n$
$$(n-2)!
\left(\frac{1}{(1-2n)^2}
\frac{(-1)^{n-1}(n-2)!}{(2n-3)!} (H_{2n-3} - H_{n-2})
\\ - \frac{2}{(1-2n)^3} \frac{(-1)^{n-1}(n-2)!}{(2n-3)!}
\right)$$
Collecting everything we find (observe that the terms on $H_{2n-2}$
and on $H_{n-2}$ and the third term from the residues at $1-n$ cancel
the same way as in the computation of the probability that we saw
earlier)
$$-\frac{1}{(2n-1)^2} H_n \left(\frac{2}{n} + \frac{1}{n(n-1)}\right)
+ \frac{1}{(2n-1)^2} \frac{1}{n(n-1)}
\\- \frac{2}{(2n-1)^3} \left(\frac{2}{n}+\frac{1}{n(n-1)}\right)
\\ + (n-1)! \frac{2}{(1-2n)^2}
\frac{(-1)^n(n-2)!}{(2n-2)!} \frac{1}{2n-2}$$
This is
$$-\frac{1}{(2n-1)n(n-1)} H_n
- \frac{1}{(2n-1)^2} \frac{1}{n(n-1)}
\\ + (n-2)! \frac{1}{(1-2n)^2}
\frac{(-1)^n(n-2)!}{(2n-2)!}$$
Flip the sign and multiply by $n^2(n-1)^2$ to obtain the formula
$$\frac{n(n-1)}{2n-1} H_n
+ \frac{n(n-1)}{(2n-1)^2}
+ (-1)^{n-1} \frac{n^2(n-1)^2}{(2n-1)^2}
\frac{(n-2)!\times(n-2)!}{(2n-2)!}$$
We conclude with the closed form (an exact result) for the case of
packets containing two coupons which is
$$\bbox[5px,border:2px solid #00A000]{
\frac{n(n-1)}{2n-1} H_n
+ \frac{n(n-1)}{(2n-1)^2}
+ (-1)^{n-1} \frac{n^2}{(2n-1)^2} {2n-2\choose n-1}^{-1}.}$$
This attractive formula obviously motivates further research, possibly
into the case $q=3,$ which looks difficult, perhaps requiring a
computer algebra system during the simplifications. Observe that the
dominant asymptotic is $1/2\times n H_n$ which means $q=2$ is about
twice as fast as a single coupon so the effect of there being no
packets with duplicate coupons is negligible.
Here is some Maple code to help with further research into these
statistics. Not optimized to the limit e.g. the sets could be replaced
with hash tables, but sufficiently functional to verify all of the
above.
with(combinat);
ENUM :=
proc(m, n, q)
local ind, setct, d, data, seen, pos, res;
setct := binomial(n, q);
data := choose({seq(p, p=1..n)}, q);
res := 0;
for ind from setct^m to 2*setct^m-1 do
d := convert(ind, base, setct);
seen :=
`union`(seq(data[d[pos]+1], pos=1..m));
if nops(seen) = n then
res := res + 1;
fi;
od;
res;
end;
T := (m, n, q) ->
add(binomial(n,p)*(-1)^p*binomial(n-p,q)^m, p=0..n);
PM :=
proc(m, n, q)
local ind, setct, d, data, seen, pos, res;
setct := binomial(n, q);
data := choose({seq(p, p=1..n)}, q);
res := 0;
for ind from setct^m to 2*setct^m-1 do
d := convert(ind, base, setct);
seen :=
`union`(seq(data[d[pos]+1], pos=1..m-1));
if nops(seen) < n and
nops(seen union data[d[m]+1]) = n
then
res := res + 1;
fi;
od;
res;
end;
PMX := (m, n, q) ->
add(binomial(n,k)*T(m-1,n-k,q)*binomial(n-k,q-k),
k=1..q);
STATAPPROX :=
proc(n, q, f, mx)
option remember;
local res, m, setct;
res := 0; setct := binomial(n, q);
for m to mx do
res := res +
f(m)*PMX(m, n, q)/setct^m;
od;
res;
end;
EXPT := (n, q) ->
binomial(n, q)
*add(binomial(n,k)*binomial(n-k,q-k)
*add(binomial(n-k,p)*(-1)^(n-k-p)
*(binomial(n,q)-binomial(p,q))^(-2),
p=0..n-k), k=1..q);
EXPT2 := n ->
n*(n-1)*harmonic(n)/(2*n-1) + n*(n-1)/(2*n-1)^2
+ (-1)^(n-1)*n^2/(2*n-1)^2*binomial(2*n-2,n-1)^(-1);
The following simple C program can be used to verify the closed
form of the expectation by numerical simulation. These were in
excellent agreement to several significant figures on all values that
were examined. Compiled with the std=gnu99 option.
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>
#include <string.h>
long choose(long n, long k)
{
long num = 1, denom = 1;
while(k > 0){
num *= n;
denom *= k;
n--; k--;
}
return num/denom;
}
void recurse(int n, int q, int *cdata, int *cposrf,
int nxt, int *sofar, int sf)
{
if(sf == q){
memcpy(cdata + q*(*cposrf), sofar, q*sizeof(int));
(*cposrf)++;
return;
}
if(nxt >= n) return;
recurse(n, q, cdata, cposrf, nxt+1, sofar, sf);
sofar[sf++] = nxt;
recurse(n, q, cdata, cposrf, nxt+1, sofar, sf);
sf--;
}
int main(int argc, char **argv)
{
int n = 6 , q = 3, trials = 1000;
if(argc >= 2){
n = atoi(argv[1]);
}
if(argc >= 3){
q = atoi(argv[2]);
}
if(argc >= 4){
trials = atoi(argv[3]);
}
assert(1 <= n);
assert(1 <= q && q < n);
assert(1 <= trials);
int allpackets = (int)choose(n, q);
int cdata[allpackets*q];
int choice[q], cpos = 0;
recurse(n, q, cdata, &cpos, 0, choice, 0);
int pidx;
for(pidx = 0; pidx < allpackets; pidx++){
int offs = pidx*q;
printf("%d", cdata[offs]);
for(int elidx = 1; elidx < q; elidx++)
printf(", %d", cdata[offs+elidx]);
printf("\n");
}
srand48(time(NULL));
long long data = 0;
for(int tind = 0; tind < trials; tind++){
int seen = 0; int steps = 0;
int dist[n];
for(int cind = 0; cind < n; cind++){
dist[cind] = 0;
}
while(seen < n){
int packet = drand48() * (double)allpackets;
int offs = packet * q;
steps++;
for(pidx = 0; pidx < q; pidx++){
int coupon = cdata[offs + pidx];
if(dist[coupon] == 0)
seen++;
dist[coupon]++;
}
}
data += steps;
}
long double expt = (long double)data/(long double)trials;
printf("[n = %d, q = %d, trials = %d]: %Le\n",
n, q, trials, expt);
exit(0);
}
Figured it was better phrased as a card question.
– Landon Bland Feb 19 '17 at 09:12