Coupon collector's variation with unequal probabilities and uneven number of items required

Question

I am not sure about this being a negative binomial distribution problem or a variation of Coupon collector's problem.

Here's the problem, Suppose, you want to build a house and I told you that you need certain type and certain number of items to build it. Let's consider you need

Bricks $- ~4$,

Cement $-~ 1$,

Metal $- ~1$,

Gravel $- \ 1$,

Wood $-\ 2$

Now, in order to get these items, you need to open a locker and every time you open it you get only one item. Also, know that the locker contains an item that you don't need at all - Feathers. The following are their probabilities

(Let's use their initials for the sake of simplicity)

B - 30%

C - 5%

F - 10%

G - 10%

M - 35%

W - 10%

The question here is what's the average number of times you will have to open the locker if you want to build a house?

The way I thought about solving this is by first multiplying the no of items required for a type by its expected number which is $\frac{1}{probability}$ for that item (for ex. $3.33$ times for Bricks multiplied by the number we want which is $4$ ) and then adding together for all the type of materials we want. I am not so good at this type of problems so please guide me.

This can be analyzed as a finite-state absorbing Markov chain with $120$ states. — saulspatz, Dec 13 '19 at 06:06

score 3 · Answer 1 · answered Dec 13 '19 at 07:00

3

As I mentioned in a comment, this can be analyzed as a finite-state absorbing Markov chain. The Wikipedia article explains the calculation in the section titled "Expected number of steps."

I wrote a python script to do this.

from itertools import product
import numpy as np

needs = [4,1,1,1,2]
probs = [0.3,0.05,0.1,0.35,0.1]

P = np.zeros((120,120))
index = { }
states = list(product(range(5),range(2),range(2),range(2),range(3)))
for idx, state in enumerate(states):
    index[state] = idx
    P[idx, idx] = .1
for idx, state in enumerate(states):
    for i in range(5):
        j = state[i] if state[i] >= needs[i] else state[i]+1
        target = tuple(state[k] if k != i else j for k in range(5))
        tgt = index[target]
        P[idx, tgt] += probs[i]

Q=P[:-1,:-1]
N = np.linalg.inv(np.eye(119)-Q)
t = N.sum(axis=1)
print(f'Expected number of lockers: {t[0]}')

This produced the output

Expected number of lockers: 31.374487169390555

answered Dec 13 '19 at 07:00

saulspatz

53,131

Thanks for the help @saulspatz. Actually, I already wrote a program in C++ and have it run for over 1000 times to get an average of about 31 times with +5/-5 here and there. My problem is that with code you can sort of get it easy with simulations but if you want to show it a formula or like actually solve it on paper, how would it be done and that's been eating my brain for a long time. I need to plot the same solution in Excel so I m looking for a paper solution so that I can turn it into Excel – Matunga Dec 13 '19 at 07:10
2

This isn't a simulation. It's an exact calculation, except for rounding in the numerical calculations. There is a formula, given in the Wikipedia article I cited, but it's not one you're going to be able to put into Excel, or at least, I don't think you can. – saulspatz Dec 13 '19 at 07:13
I will give it a try and see how it goes. Thank you once again. – Matunga Dec 13 '19 at 07:22

score 2 · Answer 2 · answered Dec 13 '19 at 18:00

One approach is to find the expected number of lockers by means of an exponential generating function (EGF). If you are not familiar with generating functions, you might be interested in some of the resources mentioned in the answer to this question: How can I learn about generating functions?

As a start, we will find the EGFs for the number of ways to find four bricks, one bag of cement, etc., weighted by their probabilities.

The EGF for four or more bricks is $$\begin{align} F_B(x) &= \frac{1}{4!} 0.3^4 x^4 + \frac{1}{5!} 0.3^5 x^5 + \frac{1}{6!} 0.3^6 x^6 + \dots \\ &= e^{.0.3 x} - 1 - 0.3 x - \frac{1}{2!} 0.3^2 x^2 - \frac{1}{3!} 0.3^3 x^3 \end{align}$$ The EGF for one or more bags of cement is $$\begin{align} F_C(x) &= 0.05 x + \frac{1}{2!} 0.05^2 x^2 + \frac{1}{3!} 0.05^3 x^3 + \dots \\ &= e^{0.05 x} - 1 \end{align}$$ The EGF for zero or more bags of feathers is $$\begin{align} F_F(x) &= 1 + 0.1 x + \frac{1}{2!} 0.1^2 x^2 + \dots \\ &= e^{0.1 x} \end{align}$$ (I hope you see the pattern by now, so I'm going to move a little more quickly...)

The EGF for one or more bags of gravel is $$F_G(x) = e^{0.1 x} - 1$$ for one or more metal $$F_M(x) = e^{0.35 x} -1$$ for two or more wood $$F_W(x) = e^{0.1 x} - 1 - 0.1 x$$ With these preliminaries out of the way, we have an easy way to find the EGF of the probability of a sequence of $n$ lockers (not necessarily the least such $n$) in which we have all the materials we need: $$g(x) = F_B(x) \cdot F_C(x) \cdot F_F(x) \cdot F_G(x) \cdot F_M(x) \cdot F_W(x)$$ That is, the coefficient of $(1/n!) \; x^n$ in $g(x)$ is the probability $p_n$ that we have a complete set of materials on or before opening the $n$th locker.

Now let's define $T_n$ as the number of the locker in which we first have a complete set of materials. Then $P(T_n > n) = 1-p_n$. Let's define $q_n = 1-p_n$; then the EGF of $q_n$ is $e^x - g(x)$. By a well-known theorem, $$E(T_n) = \sum_{n=0}^{\infty} P(T_n > n) = \sum_{n=0}^{\infty} q_n$$

We can find this sum from the EGF for $q_n$ by taking advantage of the identity $$\int_0^{\infty} x^n e^{-x} \; dx= n!$$ from which $$E(T_n) = \int_0^{\infty} e^{-x}(e^x - g(x)) \; dx$$ Numerical evaluation of this integral in Mathematica yields $$E(T_n) = 31.3745$$

this is so much helpful. I really appreciate your help. – Matunga Dec 13 '19 at 20:48 — Matunga, Dec 13 '19 at 20:48

score 1 · Answer 3 · answered Dec 13 '19 at 05:58

1

A careful analysis is hard because of the lack of symmetry. If one item is rare you can assume you have enough of all the rest by the time you get that one. Here both cement and wood need on average $20$ lockers to give you what you need.

Since both cement and wood are rare we will ignore the correlation that comes from the fact that if you get cement you can't get wood. Then if you open $n$ lockers the chance you have gotten cement is $1-0.95^n$. The chance you have gotten two wood or more is $1-0.9^n-n\cdot 0.1 \cdot 0.9^{n-1}$. The easy calculation is what is the $n$ that guarantees the product of these exceeds $0.5$, which will be close to the expected number of tries to get them both. We can ask Alpha and learn you cross the threshold at $n=24$

answered Dec 13 '19 at 05:58

Ross Millikan

374,822

Hi, Ross, thanks for the help. I am getting by what you mean about the rarity of items. Is it possible to form a mathematical function/ formula for this? I actually need to solve this in Excel so I am trying to think of a formula to relate with – Matunga Dec 13 '19 at 06:11
I have 2 more questions. 1) Why is only C and B considered here 2)Would F be also counted in the average even though we don't need it to build a house but there's a chance of it being drawn? – Matunga Dec 13 '19 at 06:26
I only considered C and W (not B) because they were the ones that would on average take the most lockers to find what you need. I said that. At the extreme you could have C and W like you said and A, needing only 1, that is the other 85%. Clearly you will have an A by the time you have a C and two Ws so we can ignore the waiting for A. Your case is not so extreme, so you could wait for one of the others longer than one C and two Ws. The Fs do not matter at all because they just change the chance of getting the other things and are reflected there. – Ross Millikan Dec 13 '19 at 06:40
Thanks again, I understood your reasoning. Would you be able to help to devise a formula on what you just explained? I am sorry if I am asking too much. What you said makes sense to me but its a little hard for me to figure out how to do it for bricks, gravel and then sort of bring all of them together to get that expected value – Matunga Dec 13 '19 at 06:58
Actually, the expectation is $31.37$ by a Markov chain calculation. – saulspatz Dec 13 '19 at 07:01

score 1 · Answer 4 · answered Dec 14 '19 at 00:40

Some great answers have already been provided. You can also do this using inclusion–exclusion.

There are five conditions to fulfill, namely to have obtained the required amount of each of the five materials. Denote by $N$ the number of lockers needed to fulfill all five conditions. Denote by $N_i$ the number of lockers needed to fulfill condition $i$, by $N_{ij}$ the number of lockers needed to fulfill at least one of conditions $i$ and $j$, and so on. Then by inclusion–exclusion

$$ P(N\gt n)=\sum_iP(N_i\gt n)-\sum_{\{i,j\}}P(N_{ij}\gt n)+\sum_{\{i,j,k\}}P(N_{ijk}\gt n)-\cdots\;. $$

Summing over $n$ yields the corresponding expression for the expectations:

$$ E[N]=\sum_iE[N_i]-\sum_{\{i,j\}}E[N_{ij}]+\sum_{\{i,j,k\}}E[N_{ijk}]-\cdots\;. $$

Denote by $m_i$ the amount of material $i$ required and by $p_i$ the probability to obtain material $i$ in a locker.

Then $E[N_i]=\frac{m_i}{p_i}$. Similarly, if $m_i=m_j=1$, then $E[N_{ij}]=\frac1{p_i+p_j}$, and if $m_i=m_j=m_k=1$, then $E[N_{ijk}]=\frac1{p_i+p_j+p_k}$.

If $m_i\gt1$ and $m_j=1$, then

$$ P(N_{ij}\gt n)=\sum_{l=0}^{m_i-1}\binom nlp_i^l(1-p_i-p_j)^{n-l} $$

and

\begin{eqnarray*} E[N_{ij}] &=& \sum_{n=0}^\infty P(N_{ij}\gt n) \\ &=& \sum_{n=0}^\infty\sum_{l=0}^{m_i-1}\binom nlp_i^l(1-p_i-p_j)^{n-l} \\ &=& \sum_{l=0}^{m_i-1}\left(\frac{p_i}{1-p_i-p_j}\right)^l\sum_{n=0}^\infty\binom nl(1-p_i-p_j)^n \\ &=& \sum_{l=0}^{m_i-1}\left(\frac{p_i}{1-p_i-p_j}\right)^l\frac{(1-p_i-p_j)^l}{(p_i+p_j)^{l+1}} \\ &=& \frac1{p_i+p_j}\sum_{l=0}^{m_i-1}\left(\frac{p_i}{p_i+p_j}\right)^l \\ &=& \frac1{p_j}\left(1-\left(\frac{p_i}{p_i+p_j}\right)^{m_i}\right)\;. \end{eqnarray*}

The calculation is essentially the same if we include more than one material of which we require only $1$, e.g. $j$ and $k$ with $m_j=m_k=1$, with $p_j+p_k$ taking the role of $p_j$ above:

$$ E[N_{ijk}]=\frac1{p_j+p_k}\left(1-\left(\frac{p_i}{p_i+p_j+p_k}\right)^{m_i}\right)\;. $$

Having in mind this way of including any number of conditions with requirement $1$, let's do $m_i\gt1$, $m_j\gt1$ with $m_k=1$ included right away, and we can set $p_k=0$ to get the result for just $m_i\gt1$, $m_j\gt1$ alone:

$$ P(T_{ijk}\gt n)=\sum_{l=0}^{m_i-1}\sum_{r=0}^{m_j-1}\binom nk\binom{n-k}rp_i^lp_j^r(1-p_i-p_j-p_k)^{n-l-r}\;, $$

and thus

\begin{eqnarray*} E[N_{ijk}] &=& \sum_{n=0}^\infty P(T_{ijk}\gt n) \\ &=& \sum_{n=0}^\infty\sum_{l=0}^{m_i-1}\sum_{r=0}^{m_j-1}\binom nk\binom{n-k}rp_i^lp_j^r(1-p_i-p_j-p_k)^{n-l-r} \\ &=& \sum_{l=0}^{m_i-1}\sum_{r=0}^{m_j-1}\frac{p_i^lp_j^r}{(1-p_i-p_j-p_k)^{l+r}}\sum_{n=0}^\infty\binom nk\binom{n-k}r(1-p_i-p_j-p_k)^n \\ &=& \sum_{l=0}^{m_i-1}\sum_{r=0}^{m_j-1}\frac{p_i^lp_j^r}{(1-p_i-p_j-p_k)^{l+r}} \binom{l+r}l\frac{(1-p_i-p_j-p_k)^{l+r}}{(p_i+p_j+p_k)^{l+r+1}} \\ &=& \frac1{p_i+p_j+p_k}\sum_{l=0}^{m_i-1}\sum_{r=0}^{m_j-1}\binom{l+r}l\left(\frac{p_i}{p_i+p_j+p_k}\right)^l\left(\frac{p_j}{p_i+p_j+p_k}\right)^r\;. \end{eqnarray*}

Now we have all the ingredients for our $31$-term inclusion–exclusion sum:

$$ E[N]= \frac4{\frac3{10}} +\frac1{\frac1{20}} +\frac1{\frac7{20}} +\frac1{\frac1{10}} +\frac2{\frac1{10}} -\frac1{\frac1{20}}\left(1-\left(\frac{\frac3{10}}{\frac3{10}+\frac1{20}}\right)^4\right) -\frac1{\frac7{20}}\left(1-\left(\frac{\frac3{10}}{\frac3{10}+\frac7{20}}\right)^4\right) -\frac1{\frac1{10}}\left(1-\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}}\right)^4\right) -\frac1{\frac1{20}}\left(1-\left(\frac{\frac1{10}}{\frac1{10}+\frac1{20}}\right)^2\right) -\frac1{\frac7{20}}\left(1-\left(\frac{\frac1{10}}{\frac1{10}+\frac7{20}}\right)^2\right) -\frac1{\frac1{10}}\left(1-\left(\frac{\frac1{10}}{\frac1{10}+\frac1{10}}\right)^2\right) -\frac1{\frac1{20}+\frac7{20}} -\frac1{\frac7{20}+\frac1{10}} -\frac1{\frac1{10}+\frac1{20}} +\frac1{\frac1{20}+\frac7{20}}\left(1-\left(\frac{\frac3{10}}{\frac3{10}+\frac1{20}+\frac7{20}}\right)^4\right) +\frac1{\frac7{20}+\frac1{10}}\left(1-\left(\frac{\frac3{10}}{\frac3{10}+\frac7{20}+\frac1{10}}\right)^4\right) +\frac1{\frac1{10}+\frac1{20}}\left(1-\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac1{20}}\right)^4\right) +\frac1{\frac1{20}+\frac7{20}}\left(1-\left(\frac{\frac1{10}}{\frac1{10}+\frac1{20}+\frac7{20}}\right)^2\right) +\frac1{\frac7{20}+\frac1{10}}\left(1-\left(\frac{\frac1{10}}{\frac1{10}+\frac7{20}+\frac1{10}}\right)^2\right) +\frac1{\frac1{10}+\frac1{20}}\left(1-\left(\frac{\frac1{10}}{\frac1{10}+\frac1{10}+\frac1{20}}\right)^2\right) +\frac1{\frac1{20}+\frac7{20}+\frac1{10}} -\frac1{\frac1{20}+\frac7{20}+\frac1{10}}\left(1-\left(\frac{\frac3{10}}{\frac3{10}+\frac1{20}+\frac7{20}+\frac1{10}}\right)^4\right) -\frac1{\frac1{20}+\frac7{20}+\frac1{10}}\left(1-\left(\frac{\frac1{10}}{\frac1{10}+\frac1{20}+\frac7{20}+\frac1{10}}\right)^2\right)\\ +\sum_{l=0}^3\sum_{r=0}^1\binom{l+r}r\left( -\frac1{\frac3{10}+\frac1{10}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}}\right)^r +\frac1{\frac3{10}+\frac1{10}+\frac1{20}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac1{20}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}+\frac1{20}}\right)^r\\ +\frac1{\frac3{10}+\frac1{10}+\frac7{20}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac7{20}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}+\frac7{20}}\right)^r\\ +\frac1{\frac3{10}+\frac1{10}+\frac1{10}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac1{10}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}+\frac1{10}}\right)^r\\ -\frac1{\frac3{10}+\frac1{10}+\frac1{20}+\frac7{20}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac1{20}+\frac7{20}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}+\frac1{20}+\frac7{20}}\right)^r\\ -\frac1{\frac3{10}+\frac1{10}+\frac7{20}+\frac1{10}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac7{20}+\frac1{10}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}+\frac7{20}+\frac1{10}}\right)^r\\ -\frac1{\frac3{10}+\frac1{10}+\frac1{10}+\frac1{20}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac1{10}+\frac1{20}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}+\frac1{10}+\frac1{20}}\right)^r\\ +\frac1{\frac3{10}+\frac1{10}+\frac1{20}+\frac7{20}+\frac1{10}}\left(\frac{\frac3{10}}{\frac3{10}+\frac1{10}+\frac1{20}+\frac7{20}+\frac1{10}}\right)^l\left(\frac{\frac1{10}}{\frac3{10}+\frac1{10}+\frac1{20}+\frac7{20}+\frac1{10}}\right)^r \right) \\ = \frac{40}3 +20 +\frac{20}7 +10 +20 -20\left(1-\left(\frac67\right)^4\right) -\frac{20}7\left(1-\left(\frac6{13}\right)^4\right) -10\left(1-\left(\frac34\right)^4\right) -20\left(1-\left(\frac23\right)^2\right) -\frac{20}7\left(1-\left(\frac29\right)^2\right) -10\left(1-\left(\frac12\right)^2\right) -\frac52 -\frac{20}9 -\frac{20}3 +\frac52\left(1-\left(\frac37\right)^4\right) +\frac{20}9\left(1-\left(\frac25\right)^4\right) +\frac{20}3\left(1-\left(\frac23\right)^4\right) +\frac52\left(1-\left(\frac15\right)^2\right) +\frac{20}9\left(1-\left(\frac2{11}\right)^2\right) +\frac{20}3\left(1-\left(\frac25\right)^2\right) +2 -2\left(1-\left(\frac38\right)^4\right) -2\left(1-\left(\frac16\right)^2\right)\\ +\sum_{l=0}^3\sum_{r=0}^1\binom{l+r}r\left( -\frac52\left(\frac34\right)^l\left(\frac14\right)^r +\frac{20}9\left(\frac23\right)^l\left(\frac29\right)^r +\frac43\left(\frac25\right)^l\left(\frac2{15}\right)^r +2\left(\frac35\right)^l\left(\frac15\right)^r -\frac54\left(\frac38\right)^l\left(\frac18\right)^r -\frac{20}{17}\left(\frac6{17}\right)^l\left(\frac2{17}\right)^r -\frac{20}{11}\left(\frac6{11}\right)^l\left(\frac2{11}\right)^r +\frac{10}9\left(\frac13\right)^l\left(\frac19\right)^r \right) \\ = \frac{40}3 -20 -\frac{20}7 -10 +20 +20\left(\frac67\right)^4 +\frac{20}7\left(\frac6{13}\right)^4 +10\left(\frac34\right)^4 +20\left(\frac23\right)^2 +\frac{20}7\left(\frac29\right)^2 +10\left(\frac12\right)^2 +\frac52 +\frac{20}9 +\frac{20}3 -\frac52\left(\frac37\right)^4 -\frac{20}9\left(\frac25\right)^4 -\frac{20}3\left(\frac23\right)^4 -\frac52\left(\frac15\right)^2 -\frac{20}9\left(\frac2{11}\right)^2 -\frac{20}3\left(\frac25\right)^2 -2 +2\left(\frac38\right)^4 +2\left(\frac16\right)^2\\ -\frac{1345}{128} +\frac{16940}{2187} +\frac{14716}{5625} +\frac{3756}{625} -\frac{9555}{4096} -\frac{2984740}{1419857} -\frac{780580}{161051} +\frac{4180}{2187} \\[15pt] =\frac{36726346111860961183807819781}{1170579965612689097244979200} \\[15pt] \approx31.37448716939056356\;, $$

in agreement with the other answers.

Coupon collector's variation with unequal probabilities and uneven number of items required

4 Answers4

Linked