Intuitive explanation for dividing by n-1 when calculating sample variance?

Question

I understand how to mathematically show that the sample variance (that involves dividing by n-1) is an unbiased estimator of the population variance (which divides by n), and the mathematics has been shown many times here on Math.SE.

I am wondering however if there is an intuitive way to understand this result that I can use to easily explain why this is done to layman. So far I have seen many derivations but I haven't seen an elegant intuitve explanation for the result.

The most intuitive way I've seen standard deviation explained is as a minimization problem for the distance squared between data points and the mean. I think in that context, the $n-1$ factor comes out quite naturally. I can't quite recall the details (and I've likely messed them up a bit). It's been years since we talked about it in my stats class in high school. — Cameron Williams, Apr 03 '15 at 01:41

BruceET · Accepted Answer · 2016-01-25T17:28:41.820

At an elementary level it is possible to give a couple of "reasons" for dividing by $n - 1$. (At higher levels there are rationales that involve discussions about n-dimensional vector spaces, but let's not go there now.)

"Reason 1." Suppose you are finding the sample variance of observations 2, 3, 1, 6. Then you computations might look like this:

        x   x - 3  square
        -----------------
        2    -1     1
        3     0     0
        1    -2     4
        6     3     9
       ----------------
   Tot 12     0    15
 Mean 3          Var = 15/3

If somehow one of the four rows between dashed lines got smudged and was unreadable, you would be able to reconstruct it from the rest of the information. (2 + 3 + 'smudge' + 6 = 12; what is 'smudge'? Etc.) So in some sense, given the structure of the computation you have only $n - 1 = 3$ rows that contain information. The jargon for that is you have "degrees of freedom $DF = n - 1$."

"Reason 2." If you divide by $n - 1$ in the definition of the sample variance $S^2$, then $E(S^2) = \sigma^2.$ In statistical terminology this means "$S^2$ is an unbiased estimator of $\sigma^2.$" If you divided by $n$ instead, then you would have an estimator of the population variance that is too small.

Note: Dividing by $n - 1$ is pretty much agreed upon, but reputable authors in statistics and probability have proposed $n$, $n + 1$, and even $n + 2$ as divisors--each giving a rationale aimed at a particular objective. None of these alternative denominators has received wide acceptance. But these discussions confirm that it is not a stupid question to ask why we use $n - 1.$

$Addendum$ (Jan 25, '16): I have just read a latter by Jeffrey S. Rosenthall (U. Toronto) in the December '15 issue of the IMS Bulletin, arguing that in elementary statistics courses it is OK to use $n$ as the denominator of the sample variance. Briefly, his view is based mainly on arguments involving mean square error (MSE). For example, with normal data, MSE for estimating $\sigma^2$ is minimized by denominator $n + 1$ instead of $n - 1.$ (See his letter on page 9 for details.)

However, in more advanced courses: as in my Comment below, a penalty for changing from $n - 1$ would be minor confusion in getting confidence intervals for $\sigma^2$ and doing tests for $\sigma^2$ based on the sample variance---mainly because $\sum (X_i - \bar X)^2/\sigma^2 \sim Chisq(df = n - 1).$

As far as I know, the textbooks never explain why it's important for the sample variance to be an unbiased estimator of the population variance, although nobody seems to mind the fact that the sample standard deviation is a biased estimator of the population standard deviation. Isn't the standard deviation more important/useful than the variance, anyway? — bof, Jan 16 '16 at 09:05
@bof: The expected value of the SD (call it $S$) can be made unbiassed by multiplying by a constant, but that constant depends on $n$ and the population distribution. If you know $S$ from normal data then you can use the fact that $(n-1)S^2/\sigma ^2 \sim Chisq(n-1)$ to get CIs for $\sigma^2$ and $\sigma.$ If $S^2$ were not defined with $n-1$ in its denominator, this distributional relationship would be a bit messier. "Its customary" may be the best, if less than satisfactory, answer to the original question. — BruceET, Jan 16 '16 at 10:43

score 4 · Answer 2 · 2016-01-16T10:52:16.660

The sample variance is computed using deviations from the sample mean $\bar x$ instead of the population mean $\mu$, and this is the source of the bias.

When the squared deviations are accumulated, there is a deficit of $\bar x-\mu$ every time, so that the computed variance is too small.

Hint:

$$(x_i-\mu)^2=((x_i-\bar x)+(\bar x-\mu))^2=(x_i-\bar x)^2+2(x_i-\bar x)(\bar x-\mu)+(\bar x-\mu)^2.$$

When averaging over $i$, the double product in the middle vanishes due to the factor $(\bar x-\bar x)$, and you get

$$\sigma^2=\overline{(x_i-\mu)^2}=\overline{(x_i-\bar x)^2}+\overline{(\bar x-\mu)^2}=s^2+\frac{\sigma^2}N.$$

Vivek · Answer 3 · 2016-01-16T10:01:27.537

The sample mean is defined as $\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i$, which is quite intuitive. But the sample variance is $S^2 = \frac{1}{n-1}\sum_{i=1}^{n} (X_i - \bar{X})^2$. Where did the $n - 1$ come from ?

To answer this question, we must go back to the definition of an unbiased estimator. An unbiased estimator is one whose expectation tends to the true expectation. The sample mean is an unbiased estimator. To see why:

$$ E[\bar{X}] = \frac{1}{n}\sum_{i=1}^{n} E[X_i] = \frac{n}{n} \mu = \mu $$

Let us look at the expectation of the sample variance,

$$ S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i^2) - n\bar{X}^2 $$

$$ E[S^2] = \frac{1}{n-1} \left( n E[(X_i^2)] - nE[\bar{X}^2] \right). $$

Notice that $\bar{X}$ is a random variable and not a constant, so the expectation $E[\bar{X}^2] $ plays a role. This is the reason behind the $n-1$.

$$E[S^2] = \frac{1}{n-1} \left( n (\mu^2 + \sigma^2) - n(\mu^2 + Var(\bar{X})) \right). $$ $$Var(\bar{X}) = Var(\frac{1}{n}\sum_{i=1}^{n} X_i) = \sum_{i=1}^{n} \frac{1}{n^2} Var(X_i) = \frac{\sigma^2}{n} $$

$$E[S^2] = \frac{1}{n-1} \left( n (\mu^2 + \sigma^2) - n(\mu^2 + \sigma^2/n) \right). = \frac{(n-1)\sigma^2}{n-1} = \sigma^2 \\ $$

As you can see, if we had the denominator as $n$ instead of $n-1$, we would get a biased estimate for the variance! But with $n-1$ the estimator $S^2$ is an unbiased estimator.

score 0 · Answer 4 · answered Apr 03 '15 at 01:59

It seems your question is related to the statistical concept of degrees of freedom. The sum of deviations of n observations from their sample mean must be zero. This means that if n-1 of the deviations are known, they completely determine the nth deviation. It is the squared deviations from the mean that are used to construct the sample variance and hence we say that the sample variance has n-1 degrees of freedom. Box, Hunter, and Hunter gives a more in depth explanation, and I'm sure many other statistics texts do as well.

score 0 · Answer 5 · answered Apr 28 '15 at 02:46

0

I think of it that you lose a degree of freedom in estimation of the mean, since to calculate variance you also need to estimate the mean. This is essentially what Bruce is saying in his comment regarding a line being smudged.

answered Apr 28 '15 at 02:46

Katie

13

Intuitive explanation for dividing by n-1 when calculating sample variance?

5 Answers5

Linked