Math Intuition and Natural Motivation Behind t-Student Distribution

Question

I am trying to understand with basic mathematical background how the $t$-Student distribution is a "natural" pdf to define. A more accessible explanation than this post, or the daunting Biometrika paper by RA Fisher.

Background:

The central limit theorem states that if ${\textstyle X_{1},X_{2},...,X_{n}}$ are each a random sample of size ${\textstyle n},$ taken from a population with mean ${\textstyle \mu }$ and finite variance ${\textstyle \sigma ^{2}}$ and if ${\textstyle {\bar {X}}}$ is the sample mean, then the limiting form of the distribution of ${\textstyle Z=\left({\frac {{\bar {X}}_{n}-\mu }{\sigma /\surd n}}\right)}$ as ${\textstyle n\to \infty },$ is the standard normal distribution.

If $X_1, \ldots, X_n$ are iid random variables $\sim N(\mu,\sigma^2)$,

$$\frac{\bar{X}\,-\,\mu}{\sigma/\sqrt{n}} \sim N(0,1)$$

This is the basis of the Z-test, $Z=\frac{\bar{X}\,-\,\mu}{\sigma/\sqrt{n}}$

[Note that the preceding opening statement is now correct after reflecting @Ian and @Michael Hardy comments to the OP in reference to the CLT.]

If the standard deviation of the population, $\sigma$, is unknown we can replace it by the estimation based on a sample, $S$, but then the expression (one-sample t-test statistic) will follow a $t$-distribution:

$$ t=\frac{\bar{X}\,-\,\mu}{S/\sqrt{n}}\sim t_{n-1}$$

with $$s=\sqrt{\frac{\sum(X_i-\bar X)^2}{n-1}}.$$

Minimal manipulations of this equation for $T$

$$\begin{align} \frac{\bar{X}\,-\,\mu}{S/\sqrt{n}} &= \frac{\bar{X}\,-\,\mu}{\frac{\sigma}{\sqrt{n}}} \frac{1}{\frac{S}{\sigma}}\\[2ex] &= Z\,\frac{1}{\frac{S}{\sigma}}\\[2ex] &= \frac{Z}{\sqrt{\frac{\color{blue}{\sum(X_i-\bar X)^2}}{(n-1)\,\color{blue}{\sigma^2}}}}\\[2ex] &\sim\frac{Z}{\sqrt{\frac{\color{blue}{\chi_{n-1}^2}}{n-1}}}\\[2ex] &\sim t_{n-1}\small \tag 1 \end{align}$$

will introduce the chi square distribution, $(\chi^2).$

The chi square is the distribution that models $X^2$ with $X\sim N(0,1)$:

Let's say that $X \sim N(0,1)$ and that $Y=X^2$ and find the density of $Y$ by using the $\text{cdf}$ method:

$$\Pr(Y \leq y) = \Pr(X^2 \leq y)= \Pr(-\sqrt{y} \leq x \leq > \sqrt{y}).$$

We cannot integrate in close form the density of the normal distribution. But we can express it:

$$ F_X(y) = F_X(\sqrt{y})- F_X(-\sqrt[]{y}).$$ Taking the derivative of the cdf:

$$ f_X(y)= F_X'(\sqrt{y})\,\frac{1}{\sqrt{y}}+ > F_X'(\sqrt{-y})\,\frac{1}{\sqrt{y}}.$$

Since the values of the normal $\text{pdf}$ are symmetrical:

$$ f_X(y)= F_X'(\sqrt{y})\,\frac{1}{\sqrt{y}}.$$

Equating this to the $\text{pdf}$ of the normal (now the $x$ in the $\text{pdf}$ will be $\sqrt{y}$ to be plugged into the $e^{-\frac{x^2}{2}}$ part of the normal $\text{pdf}$); and remembering to in include $\frac{1}{\sqrt{y}}$ at the end:

$$\begin{align} f_X(y) &= F_X'\left(\sqrt{y}\right)\,\frac{1}{\sqrt[]{y}}\\[2ex] &=\frac{1}{\sqrt{2\pi}}\,e^{-\frac{y}{2}}\, \frac{1}{\sqrt[]{y}}\\[2ex] &=\frac{1}{\sqrt{2\pi}}\,e^{-\frac{y}{2}}\, y^{\frac{1}{2}- 1} \end{align}$$

Comparing to the pdf of the chi square:

$$ f_X(x)= \frac{1}{2^{\nu/2}\Gamma(\frac{\nu}{2})}e^{\frac{-x}{2}}x^{\frac{\nu}{2}-1}$$

and, since $\Gamma(1/2)=\sqrt{\pi}$, for $\nu=1$ df, we have derived exactly the $\text{pdf}$ of the chi square.

In the case of the $t$-distribution the chi-square is suitable to model the sum of squared normals, i.e $\displaystyle \sum(X_i-\bar X)^2 $ in the set of Eq $(1),$ a well known property derived here typically with $n$ degrees of freedom, but

why is it $n\,-\,1$ here, i.e. $\color{blue}{\chi^2_{n-1}}$ in eq. $(1)$?

I don't know how to explain $\sigma^2$ in $\frac{\sum(X_i-\bar X)^2}{\sigma^2}$ in equation $(1)$ becoming "absorbed" into the $\chi_{n-1}^2$ part of the $\text{t-Student's pdf}.$

So it boils down to understanding why

$$\frac 1 {\sigma^2} \left((X_1-\bar X)^2 + \cdots + (X_n - \bar X)^2 \right) \sim \chi^2_{n-1}.$$

After that the derivation of the pdf is not that daunting.

PS: The accepted answer below becomes clear after reading this Wikipedia entry. Also, a plot may be useful including the spherical cloud $(X_1,X_2,X_3)$ (in red) with the orthogonal projection, $(X_1-\bar X,\ldots,X_n-\bar X),$ on the $n-1$ subspace forming a plane (in blue) at the origin:

A correction before I continue reading: the fact that the rescaled sample mean of normal variables is N(0,1) is only loosely related to the central limit theorem. The fact that the mean and variance are what they are does not require normality, just independence. The fact that the result is normally distributed is because the normal distribution is stable, in the sense that a sum of independent normal variables is normal. The central limit theorem would be impossible without this fact, but this is still a much weaker fact than the central limit theorem itself. — Ian, Aug 04 '15 at 17:51
That quantity exactly has N(0,1) distribution. If this were not the case, then the CLT could never hold. But this is not actually implied by the CLT, which only talks about asymptotics. By contrast your quantity is a finite sum with an exactly normal distribution (since you started with normal variables). — Ian, Aug 04 '15 at 19:37
Not that I know of. The mean being what it is arises from linearity and the fact that the expected value of a constant equals the constant. The variance being what it is arises from noncorrelation of the independent cross terms. The normality is called the stability of the normal distribution. — Ian, Aug 04 '15 at 19:57
t-distribution can be derived from normal distribution and chi square distribution. Such resulted distribution is mathematically very similar to normal distribution (though ‘physical’ meanings are different), then why we don’t just use normal distribution to make things simpler? — Charlie Chang, Sep 21 '20 at 11:17
@CharlieChang You can use any distribution you want. Reality is always going to be more complex. The t-distribution is subexponential or heavy-tailed, and hence, more apt to analyze smaller samples. — Antoni Parellada, Sep 21 '20 at 13:03
I see, so in CLT we have A=(sample mean-population mean)/(standard error $\sigma/\sqrt{n}$) and it has normal distribution, t-distribution is to replace $\sigma$ with sample distribution, namely, to divide A by $S/\sigma=\sqrt{Y/(n-1)}$, where Y has chi square distribution of n-1 deg of freedom, and this will ‘distort’ the normal distribution of A. The bigger the deg of freedom, the more chi square distribution tends to a narrow peak around n-1, and so the more similar is $A/(S/\sigma)$‘s distribution to that of $A/\sqrt{(n-1)/(n-1)}$. — Charlie Chang, Sep 27 '20 at 16:42

Michael Hardy · Accepted Answer · 2015-10-03T13:10:39.407

You wrote "which is an expression of the central limit theorem". That is not correct. You started with a normal distribution. The central limit theorem says if you start with any distribution with finite variance, not assumed to be normal, the sum of i.i.d. copies will be nearly normally distributed.

Your main question seems to be: how does the chi-square distribution get involved?

You have $X_1,\ldots,X_n\sim\text{ i.i.d. } N(\mu,\sigma^2)$ and $\bar X = (X_1+\cdots+X_n)/n$. The proposition then is $$ \frac 1 {\sigma^2} \Big((X_1-\bar X)^2 + \cdots + (X_n - \bar X)^2 \Big) \sim \chi^2_{n-1}. $$

The random variables $(X_i - \bar X)/\sigma$ are not independent, but have covariance $-1/n$ between any two of them.

The standard deviation of each of them is not $1$, but $\sqrt{(n-1)/n\,{}}$.

There are not $n-1$ of them, but $n$.

But the distribution of the sum of their squares is asserted to be the same as if they were (1) independent and (2) each has standard deviation $1$ and (3) there are $n-1$ of them.

This can be understood geometrically, as follows: The mapping $(X_1,\ldots,X_n)\mapsto (\bar X, \ldots, \bar X)$ is the orthogonal projection onto a $1$-dimensional subspace of $\mathbb R^n$, and it complementary mapping $(X_1-\bar X,\ldots,X_n-\bar X)$ is the orthogonal projection onto the $(n-1)$-dimensional subspace defined by the constraint that the sum of the coordinates is $0$. Notice that the latter projection takes the mean vector $(\mu,\ldots,\mu)$ to $(0,\ldots,0)$.

The distrbution of $(X_1,\ldots,X_n)$ is spherically symmetric about the mean vector, since it is $$ \text{constant}\cdot \exp\left( -\frac 1 2 \sum_{i=1}^n\left( \frac{x_i-\mu} \sigma \right)^2 \right)\,dx_1\,\cdots\,dx_n. $$

The distribution of an orthogonal projection of this random vector, which projection takes the mean vector to $0$, is spherically symmetric about $0$ in the lower-dimensional space onto which one projects. So let $(U_1,\ldots,U_{n-1})$ be the coordinates of $(X_1-\bar X,\ldots, X_n-\bar X)$ relative to some orthonormal basis of that $(n-1)$-dimensional subspace, and then you have $$ (X_1-\bar X)^2 + \cdots + (X_n-\bar X)^2 = U_1^2 + \cdots + U_{n-1}^2 \tag 1 $$ and $$ U_2,\ldots,U_n\sim\text{ i.i.d. }N(0,\sigma^2). $$

Therefore $(1)$ is distributed as $\sigma^2\chi^2_{n-1}$.

Also, notice that $\bar X$ is actually independent of $(X_1 - \bar X,\ldots,X_n-\bar X)$. That follows from the fact that they are jointly normally distributed and $\operatorname{cov}(\bar X, X_i-\bar X)=0$. That is also needed in order to conclude that you get a $t$-distribution.

PS: One can do this with matrices: There is an $n\times n$ matrix $Q$ for which $$ QX=Q\begin{bmatrix} X_1 \\ \vdots \\ X_n \end{bmatrix} = \begin{bmatrix} X_1 - \bar X \\ \vdots \\ X_n - \bar X \end{bmatrix}. $$ It satisfies $Q^2=Q^T=Q$. And $P=I-Q$ satisfies $P^2=P^T=P$, and $QP=PQ=0$. Then we have $$ \operatorname{cov}(PX, QX) = P\Big(\operatorname{cov}(X,X)\Big) Q^T = P(\sigma^2 I)Q^T = \sigma^2 PQ = 0. $$ and $$ QX \sim N(0,\sigma^2 Q\Big(\sigma^2 I\Big) Q^T) = N(0,\sigma^2 Q). $$ (We get $0$ as the mean hear because $Q$ times the mean vector is $0$.) Our denominator in the $t$-statistic is then $\|QX\|/\sqrt n$.

PS: Here's somewhat different way of expressing it. The vector $(X_1,\ldots,X_n)$ has expected value $(\mu,\ldots,\mu)$. Let $(U_1,\ldots,U_{n-1},U_n)$ be the coordinates of the point $(X_1,\ldots,X_n)$ in a different coordinate system: The $n$th component $U_n$ is the position on an axis pointing from $(0,\ldots,0)$ in the direction of $(\mu,\ldots,\mu)$, and let the other $U_k$, for $k=1,\ldots,n-1$ are components in directions at right angles to that one. In the first coordinate system, the projection of $(X_1,\ldots,X_n)$ onto the space in which the sum of the components is $0$ is $(X_1-\bar X,\ldots,X_n-\bar X)$. In the second coordinate system the projection of $(U_1,\ldots,U_{n-1},U_n)$ onto that same space is $(U_1,\ldots,U_{n-1},0)$. Transforming $(\mu,\ldots,\mu)$ into the second coordinate system, we get $(0,\ldots,0,\mu\sqrt n)$, and the projection of that onto the aforementioned subspace is $(0,\ldots,0,0)$. Hence $(X_1-\bar X)^2+\cdots+(X_n-\bar X)^2 = U_1^2+\cdots+U_{n-1}^2$ and $U_1,\ldots,U_{n-1}\sim\text{ i.i.d. } N(0,\sigma^2)$.

It is correct that $\displaystyle \frac{\bar X-\mu}{\sigma/\sqrt n}\sim N(0,1)$. What is incorrect is that the central limit theorem is involvded in reaching that conclusion. — Michael Hardy, Aug 04 '15 at 17:59
I see it now... the $\sigma/\sqrt{n}$ made me think of the SD of the CLT... — Antoni Parellada, Aug 04 '15 at 18:07
@Michel Hardy How can I proof that the covariance is $-1/n$? Also, do you want to get rid of your first paragraph since I erased the CLT mistake, and now it comes across as a non sequitur? — Antoni Parellada, Aug 04 '15 at 18:11
@AntoniParellada Expand out $E[(X_i-\overline{X})(X_j-\overline{X})]$ by writing it as $E[(X_i-\mu+\mu-\overline{X})(X_j-\mu+\mu-\overline{X})]$. Then the result is $E[(X_i-\mu)(X_j-\mu)]+E[(X_i-\mu)(\mu-\overline{X})]+E[(X_j-\mu)(\mu-\overline{X})]+E[(\mu-\overline{X})^2]$. The first term is zero; the last term is $1/n^2$; the middle terms are where you have to do some work. — Ian, Aug 04 '15 at 18:25
Sorry, I made an error: the last term is $1/n$. It will turn out that both middle terms are $-1/n$. This is closely related to the derivation of Bessel's correction. — Ian, Aug 04 '15 at 18:31
@Ian Shouldn't the $E[(\mu - \bar X)^2]$ be zero? population mean - sample mean... — Antoni Parellada, Aug 04 '15 at 18:44
@AntoniParellada With no square, you'd be right. But there is a square. So $E[(\mu-\overline{X})^2]=\text{Var}(\overline{X})$, which is not zero. — Ian, Aug 04 '15 at 19:03
@MichaelHardy I think you mean "the variance is not 1, but ..." — Ian, Aug 04 '15 at 20:18
I'll be back in a few hours and add some comments. The thing about the central limit theorem looks ok now. — Michael Hardy, Aug 04 '15 at 20:32
Note to self: $var\left(\frac{ X_i-\bar X}{\sigma}\right)=\frac{1}{\sigma^2}\left( var(X_i) + var(\bar X) - 2 cov(X_i,\bar X)\right)=\frac{1}{\sigma^2}\left(\sigma^2+\frac{\sigma^2}{n}-2\frac{\sigma^2}{n}\right)=1-\frac{1}{n}=\frac{n-1}{n}.$ — Antoni Parellada, Nov 15 '17 at 08:48
Note to self: Pertinent related post solving many prior comments here. — Antoni Parellada, Nov 15 '17 at 21:21

Math Intuition and Natural Motivation Behind t-Student Distribution

1 Answers1

Linked