Beta function derivation

Question

How do I derive the Beta function using the definition of the beta function as the normalizing constant of the Beta distribution and only common sense random experiments?

I'm pretty sure this is possible, but can't see how.

I can see that

$$\newcommand{\Beta}{\mathrm{Beta}}\sum_{a=0}^n {n \choose a} \Beta(a+1, n-a+1) = 1$$

because we can imagine that we are flipping a coin $n$ times. The $2^n$ unique sequences of flips partition the probability space. The Beta distribution with parameters $a$ and $n-a$ can be defined as the prior over the coin's bias probability $p$ given the observation of $a$ heads and $n-a$ tails. Since there are ${n \choose a}$ such sequences for any $n$ and $a$, that explains the scaling factor, and we know that it all sums to unity since the sequences partition the probability space, which has total measure 1.

What I can't figure out is why:

$${n \choose a} \Beta(a+1, n-a+1) = \frac{1}{n+1} \qquad \forall n \ge 0,\quad a \in \{0, \dots, n\}.$$

If we knew that, we could easily see that

$$\Beta(a + 1,n - a + 1) = \frac{1}{(n+1){n \choose a}} = \frac{a!(n-a)!}{(n+1)!}.$$

I'm pretty sure the proper identity for $B(a,n-a)$ is $B(a,n-a)=\left((n-1){{n-2} \choose {a-1}}\right)^{-1}$ — J. M. ain't a mathematician, Aug 28 '10 at 08:47
You're right. I'll fix my parametrization to match. I wonder why they defined the Beta distribution/function to require those "+1". — Neil G, Aug 28 '10 at 08:52
I would assume it's to simplify a certain integral expression that can be shown to be equivalent to the beta function. — J. M. ain't a mathematician, Aug 28 '10 at 09:06
I figured that the Beta function was invented as the normalization constant of the Beta distribution, but I guess it could have existed first. I wonder where it came from then? — Neil G, Aug 28 '10 at 09:08
It's Euler who came up with the integral IIRC. Maybe somebody more knowledgeable in history can chime in. — J. M. ain't a mathematician, Aug 28 '10 at 09:35
This comes from the function $\Gamma$. As you know $\Gamma(n+1)=n!$, and I believe Legendre is the one who shifted the argument. Euler before him and others after him (such as Gauss) used the non shifted version (and Gauss used the alternative notation $\pi(x)$ to denote the result) but for some reason the shifted version prevailed. — Did, Aug 20 '11 at 17:00
"The Beta distribution with parameters $a$ and $n−a$ can be defined as the prior over the coin's bias probability $p$ given the observation of $a$ heads and $n−a$ tails." I'm not familiar with the language you're using here. Can you tell me what this means? — Eric Auld, Apr 09 '14 at 20:24
@EricAuld: Suppose I give you a coin whose bias $p$ you don't know. You flip the coin $n$ times and get heads $a$ times. Then the likelihood induced on $p$ is a Beta distribution. You can work out the Beta distribution up to normalization by writing out the product of the Bernoulli mass function $p^x(1-p)^{1-x}$ for each realization $x_i \in {0,1}$ (where $\sum_i x_i = a$ and $\sum_i 1-x_i = n-a$). — Neil G, Apr 09 '14 at 20:31
Looking up the wikipedia article on these got me somewhere, but not all the way. I'm not familiar with the term "likelihood induced". I'm thinking of Bayes' rule $$\mathbb{P}(q<x \mid a \text{ heads})=\frac{\mathbb{P}(a \text { heads} \mid q<x) \mathbb{P}(q<x)}{\mathbb{P}(a \text{ heads})},$$ but I'm not seeing how to fill in $\mathbb{P}(a \text{ heads} \mid q<x)$, for example (I only know for $q=x$). If you don't have time to answer, do you perhaps have a reference? — Eric Auld, Apr 09 '14 at 22:54
@EricAuld: Bayes rule is much easier to think about without the denominator, which is just a constant given your observations. So: $P(q\mid a) \propto P(a\mid q) P(q)$. (These terms are called posterior, likelihood, and prior respectively.) Then, you can fill it in: $\int_0^x f(q \mid a) dq \propto \int_0^x f(q)f(a \mid q)dq \implies f(q \mid a) \propto f(q)f(a \mid q)$. And, the likelihood $f(a \mid q)$ is just $q$ for heads and $1-q$ for tails. Feel free to ask a question on stats.stackexchange. There are a lot of very helpful people there. — Neil G, Apr 10 '14 at 07:49

score 58 · Accepted Answer · edited Aug 20 '11 at 16:06

For non-negative integers $a, b$ and $t \in [0, 1]$, the expression $t^a (1 - t)^b$ describes the probability of randomly selecting $a+b$ real numbers in $[0, 1]$ such that the first $a$ are in $[0, t]$ and the last $b$ are in $[t, 1]$. The integral $\int_0^{1} t^a (1 - t)^b dt$ then describes the probability of randomly selecting $a+b+1$ real numbers such that the first number is $t$, the next $a$ numbers are in $[0, t]$, and the next $b$ numbers are in $[t, 1]$.

It follows that $ {a+b \choose b} \int_0^1 t^a (1 - t)^b dt$ describes the probability of randomly selecting $a+b+1$ real numbers such that the first number is $t$, some $a$ of the remaining numbers are in $[0, t]$, and some $b$ of the remaining numbers are in $[t, 1]$. But this is the same as the probability that the first number happens to be $(a+1)^{st}$ in order, and this is just $\frac{1}{a+b+1}$. Hence

$$\int_0^1 t^a (1 - t)^b dt= \frac{a! b!}{(a+b+1)!}$$

as desired. I learned this proof through an exercise in a Putnam training seminar; the multidimensional generalization also works.

I am not getting the interpretation of the integral $\int_{0}^{1} t^a(1-t)^bdt$ as randomly selecting $a+b+1$ reals. Can you explain please? — tarit goswami, Aug 15 '18 at 14:30
@taritgoswami this answer here https://math.stackexchange.com/a/1075791/405572 provides a rewriting of Qiaochu's proof that you might find helpful — D.R., Apr 04 '23 at 09:14

Jeff Hussmann · Answer 2 · 2011-01-20T18:35:31.860

The multinomial generalization mentioned by Qiaochu is conceptually simple but getting the details right is messy. The goal is to compute $$\int_0^1 \int_0^{1-t_1} \ldots \int_0^{1-t_1-\ldots-t_{k-2}} t_1^{n_1} t_2^{n_2} \ldots t_{k-1}^{n_{k-1}} t_k^{n_k} dt_1 \ldots dt_{k-1},$$ where $t_k = 1 - t_1 - \ldots - t_{k-1},$ for nonnegative integers $n_1, \ldots, n_k$.

Draw $k-1 + \sum_{i = 1}^{k}n_k$ numbers $X_1, \ldots, X_{k-1 + \sum_{i = 1}^{k}n_k}$ independently from a uniform $[0,1]$ distribution. Define $X_0 = 0$ and $X_{k + \sum_{i = 1}^{k}n_k} = 1$ for convenience. Let $E$ be the event that the numbers $X_1$ through $X_{k-1}$ are in ascending order and that the numbers $X_{j + \sum_{i = 1}^{j-1} n_i}$ through $X_{j + \sum_{i = 1}^{j}n_i - 1}$ are between $X_{j-1}$ and $X_j$ for $j = 1, \ldots, k$.

Define a linear transformation from $(X_1, \ldots, X_{k-1}) \to (T_1, \ldots, T_{k-1})$ by $T_i = X_i - X_{i-1}$ for $i = 1, \ldots, k-1$. Note that the determinant of this linear transformation is 1 and it is therefore measure-preserving. Given values of $X_1$ through $X_{k-1}$, the conditional probability of $E$ is $$\mathbb{P}[E|(X_1, \ldots, X_{k-1}) = (x_1, \ldots, x_{k-1})] = \prod_{i = 1}^{k}(x_i - x_{i-1})^{n_k} \mathbf{1}_{\{x_i > x_{i-1}\}}.$$ Marginalizing with respect to the distribution of $X_1 \times \ldots \times X_{k-1}$ gives $$\begin{aligned} \mathbb{P}[E] &= \int_{0}^1 \ldots \int_{0}^1 \prod_{i = 1}^{k}(x_i - x_{i-1})^{n_k} \mathbf{1}_{\{x_i > x_{i-1}\}} p_{X_1 \times \ldots \times X_{k-1}}(x_1, \ldots, x_{k-1}) dx_{k-1} \ldots dx_{1} \\ &= \int_{0}^1 \int_{-t_1}^{1-t_1} \ldots \int_{-t_1 - \ldots - t_{k-1}}^{1 -t_1 - \ldots - t_{k-1}} \prod_{i = 1}^{k} t_k^{n_k} \mathbf{1}_{\{t_k > 0\}} p_{T_1 \times \ldots \times T_{k-1}}(t_1, \ldots, t_{k-1}) dt_{k-1} \ldots dt_{1} \\ &= \int_0^1 \int_0^{1-t_1} \ldots \int_0^{1-t_1-\ldots-t_{k-2}} t_1^{n_1} \ldots t_{k-1}^{n_{k-1}} t_k^{n_k} dt_{k-1} \ldots dt_{1}, \end{aligned}$$ so if we can compute $\mathbb{P}[E]$ combinatorially we will have evaluated the desired intergral.

Let $\{R_i\}_{i \in \{1, \ldots, k-1 + \sum_{i = 1}^{k}n_k\}}$ be the ranks that the numbers $\{X_i\}_{i \in \{1, \ldots, n+m+1\}}$ would have if sorted in ascending order. (Note that the numbers are all distinct with probability 1). Since the numbers were drawn independently from a uniform distribution, the ranks are a random permutation of the integers $1$ through $k-1 + \sum_{i = 1}^{k}n_k$. Note that $E$ is exactly the event that $R_j = j + \sum_{i = 1}^j n_i$ for $j \in \{1, \ldots, k-1\}$ and that for each $l \in \{1, \ldots, k\}$, $$R_j \in \{l + \sum_{i = 1}^{l-1} n_i, \ldots, l + \sum_{i=1}^{l}n_i - 1\}$$ for $$j \in \{k+\sum_{i = 1}^{l-1}n_i, \ldots, k + \sum_{i = 1}^{l}n_i - 1\}.$$ There are $n_1!\ldots n_k!$ possible permutations which satisfy these conditions out of $(\sum_{i=1}^{k}n_i+k-1)!$ total possible permuations, so $$\mathbb{P}[E] = \frac{n_1!\ldots n_k!}{(\sum_{i=1}^{k}n_i+k-1)!}.$$

Beta function derivation

2 Answers2

Linked