2

I have a line with $n$ spaces initially empty, and in each iteration, I randomly choose one of them to insert an element. If it already has 1, the iteration doesn't add anything to the space. Eventually, after enough amount of iterations, all spaces will be full.

Then, to visualize the number of iterations needed to finish the process (fill all spaces), I repeat the process with a fixed number of spaces and come up with the following histogram: enter image description here

The above graph is the result of 1 million process repetitions with $n=10$, showing a distribution of the number of iterations for the specific size. As you can see, I tried to fit the histogram with a Gamma Distribution. My question is, how can I create an expression that outputs the parameters $k$ and $\theta$ of the gamma distribution as a function of $n$?

I also tried to fit the histogram with a Negative Binomial distribution too, as the probability of success (inserting element) in each iteration changes over time, but it seems a better option to continue with the Gamma distribution since it has a continuous nature.

joriki
  • 238,052
Cardstdani
  • 93
  • 7
  • 1
    This is the distribution found in the Coupon Collector problem, whose PMF is given here for example. Agnostic of this fact, if you want to choose $k$ and $\theta$ to fit your histogram, one naïve idea might be to set $k\theta$ and $k\theta^2$ to your sample mean and sample variance respectively, and solve for $k$ and $\theta$ (method of moments. But it might be possible that the gamma distribution is not a great fit. – angryavian Jan 02 '24 at 18:56
  • @angryavian I tried to fit with method of moments but it results in a slight worse approximation using Gamma Distribution. – Cardstdani Jan 02 '24 at 19:41
  • @jean-claude-arbaut Yes, it's the method used in the question plot. The problem is to generalize this parameters as a function of n. – Cardstdani Jan 02 '24 at 21:57

1 Answers1

2

The coupon collector’s problem has a limiting distribution:

$$ P(T\lt n\log n+cn)\to_{n\to\infty}\mathrm e^{-\mathrm e^{-c}}\;. $$

Solving for $c$ yields an approximate distribution for given $n$:

$$n\log n+cn=t\quad\to\quad c=\frac tn-\log n\;,$$

so

\begin{eqnarray*} P(T\lt t) &\approx& \exp\left(-\exp\left(-\frac tn+\log n\right)\right) \\ &=& \exp\left(-n\exp\left(-\frac tn\right)\right)\;. \end{eqnarray*}

Differentiating with respect to $t$ yields the approximate probability

$$ P(T=t)\approx\exp\left(-\frac tn-n\exp\left(-\frac tn\right)\right)\;. $$

Here’s a plot for $n=10$.

joriki
  • 238,052
  • It worked with the exponential, but tried this formula involving the stirling number of the second kind and the fit seemed to be perfect. – Cardstdani Jan 03 '24 at 08:48
  • 1
    @Cardstdani: Yes, that's the exact expression (note that that answer is by me :-) – I didn't mention that because it was already mentioned in the comments and you didn't respond to that, so I figured you were interested in fitting a curve rather than getting a complicated exact expression :-) – joriki Jan 03 '24 at 08:57
  • The reason I used an exact expression (despite being a bit complex computationally) is that now I'm trying to fit to a distribution from a process where I have $n\times n$ spaces and finishes when there is a path from top-bottom of the spaces matrix. At the moment I'm trying to calculate the proportion of arrangements with at least one valid path, as you can see here. Anyways, thanks for your answers. – Cardstdani Jan 03 '24 at 09:03