Where does the Formula for the Standard Deviation come from?

Question

In school, we are taught the following: Suppose you collect some data $X : x_1, x_2, ... x_n$. Regardless of the underlying distribution of $X$, you can always quantify the Standard Deviation of $X$ as:

$$ SD(X) = \sqrt{\frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n}}$$

I am interested in learning about where this formula comes from. Informally, I can infer that the above formula is a function of how much each $x_i$ differs from the average - but I am interested in learning about the mathematical justifications as to why in theory, this formula can be applied to any data regardless of the underlying probability distribution.

By doing some reading on this topic, here is what I have come up with so far:

Suppose you have a Random Variable $X$ with a probability distribution $f(x)$. Suppose you also have a finite set of random observations from this same $f(x), X : x_1, x_2, ... x_n$. I am guessing (???) that regardless of what $f(x)$ is, the following statement is true:

$$ E(X) = \int x * f(x) dx \approx \frac{\sum x_i}{n}$$

And by extension - suppose you have a Random Variable $X$ with a probability distribution $f(x)$. Suppose you also have a finite set of random observations from this same $f(x), X : x_1, x_2, ... x_n$. I am guessing (???) that regardless of what $f(x)$ is, the following statement is true:

$$ E(X^2) = \int x^2 * f(x) dx \approx \frac{\sum x_i ^2 }{n}$$

As I understand, these are useful results because they allow you to approximate a function of $f(x)$ without explicitly knowing what $f(x)$ is.

Using some basic algebra, I can see that:

$$ E(X^2) - (E(X))^2 = \frac{\sum x_i ^2 }{n} - \left[\frac{\sum x_i}{n}\right]^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n}$$

And $ E(X^2) - (E(X))^2$ is how we define the Variance of $X$ denoted by $Var(X)$ : And thus we know the Standard Deviation of $X$ - without relying on the underlying probability distribution of $X$ i.e. $f(x)$

Can someone please tell me if my analysis is correct?

Thanks!

PS: Can we use Law of Large Numbers to Prove that:

$$ \frac{1}{n} \sum_{i=1}^n x_i^k \xrightarrow{} \int_{\mathbb R} x^k \cdot f(x) \mathrm{d}x $$

@ Andrew Zhang: thank you for your reply! Is the following statement true? $$ E(X) = \int x * f(x) dx \approx \frac{\sum x_i}{n}$$ — stats_noob, Aug 02 '23 at 06:19
@DavidK The last sentence of your comment doesn't make sense. Also, there are different estimators for the standard deviation, the Bessel corrected estimator is not somehow better than the plug in estimator. — Andrew, Aug 02 '23 at 06:19
Yes, that is just the law of large numbers @stats_noob. Another way is to view it as the plug in estimator for the sample mean — Andrew, Aug 02 '23 at 06:21
If you are not familiar with the plug in estimator, the concept is this. Many quantities of interest are functionals of the distribution function $F$, for example the expectation operator $E$ can be viewed as a functional on the space of (nice) distribution functions $E: F\mapsto \int x,dF(x)$. If you have an i.i.d sample, you can estimate $F$ with the ECDF $\hat F_n$, this is justified by Glivenko-Cantelli. The plug in estimator is defined as using $\hat F_n$ to estimate said quantity of interest in place of $F$. — Andrew, Aug 02 '23 at 06:27
@AndrewZhang I suppose what I'm missing is that I tend to think of bias and not maximum likelihood (and yes, both estimators actually are biased in this case). It does seem to be conventional, however, to use the sample s.d. estimator for samples. — David K, Aug 02 '23 at 06:37
@AndrewZhang : Thank you for your reply! I am trying to see if there is a way to prove formulas such as: $$ E(X^2) = \int x^2 * f(x) dx \approx \frac{\sum x_i ^2 }{n}$$ — stats_noob, Aug 02 '23 at 06:39
@stats_noob --- answered in the comments and in replies here(https://math.stackexchange.com/questions/4746246/why-can-we-replace-p-with-its-estimate-hatp-and-not-lose-the-normality-of/4746252#4746252) and https://math.stackexchange.com/questions/4742029/does-the-central-limit-theorem-work-for-a-single-sample/4745085#4745085 — Annika, Aug 02 '23 at 21:00
And in this reply: https://math.stackexchange.com/a/4391692/949989 — Kurt G., Aug 04 '23 at 07:48
Historically I do not know, but my mentor convinced me it is the most practical ergo easiest to compute with some smart tricks. I forgot the details though. It also makes sense to use the square because it does not distinguish between negative and positive input. So if a value is 2 below or above the average, its contribution to the standard deviation is the same. Using absolute value is not differentiable and numerically harder to work with. — mick, Aug 09 '23 at 20:38
...Also why use 4 th powers ? that would be silly and distort things alot. So we use squares. And we ONLY compare the values X with the average. Therefore not leaving us much choice to define it and making the expression simple and short. — mick, Aug 09 '23 at 20:41

Jose L. Arregui · Answer 1 · 2023-08-07T00:47:14.317

The standard deviation is the simplest measure of how far the sample is from being constant. It is the normalized euclidean distance from the sample to the space (line) of constant samples.

Note that the closest (in euclidean sense) constant $n$-dimensional vector $(t,t,\dots, t)$ to the sample vector $\boldsymbol X=(X_1, X_2, \dots, X_n)$ is the one given by $t=\overline{\boldsymbol X}$ (this is easily seen, for instance taking derivatives in $\sum_i(X_i-t)^2$ as a function of $t$). The distance is $\sqrt{\sum_i(X_i-\overline{\boldsymbol X})^2}$. This distance is normalized (dividing by $\sqrt n$) so that the distance between the constant vectors $(0,0,\dots,0)$ and $(1,1,\dots,1)$ is $1$ regardless of the sample size $n$.

[edited --three days later--] Up to this line it was of course an answer to the question "what does the formula come from", thinking of the standard deviation as the sample s.d. without Bessel's correction. Let's denote it $S_u(\boldsymbol X)$. In order to put it in context with random variables, their expected value and the Strong Law of Large Numbers, the key point is that (as mentioned in the question text) the sample variance (i.e. the squared normalized distance) verifies that $S_u^2(\boldsymbol X) = \overline{{\boldsymbol X}^2}-(\overline{\boldsymbol X})^2$, and then for any random variable $X$ such that ${\rm E}X^2$ is finite it follows, by the LLN, that $S_u^2(\boldsymbol X)\underset{a.s.}\longrightarrow {\rm E}X^2-({\rm E}X)^2$.

Therefore, everything fits perfectly now if we consider the usual $L^2$ norm as the distance between random variables in any given probability space: the closest constant variable to $X$ is ${\rm E}X$, and the squared distance between $X$ and the space of constant variables is then ${\rm E}(X-{\rm E}X)^2$, which is called the variance of $X$, ${\rm Var}\, X$, and equals ${\rm E}X^2-({\rm E}X)^2$. The standard deviation of $X$ is finally defined as $\sigma_X = \sqrt{{\rm Var}\, X}$ (i.e. the $L^2$ distance between $X$ and ${\rm E}X$), and it results that $$ S_u(\boldsymbol X)\underset{a.s.}\longrightarrow \sigma_X\,. $$

As a final note, this makes it clear why we have to normalize distances in ${\mathbb R}^n$ ($n$ is the sample size) if we want to use statistics to get estimates about a random variable.

score 0 · Accepted Answer · answered Aug 10 '23 at 14:27

There are two senses in which we can view the sample standard deviation formula.

As an exact value of the standard deviation f the empirical distribution implied by the sample
As an estimator of the standard deviation of the distribution from which the samples were drawn.

In either case, the standard deviation is based on the second central moment $\mu_2$ of a distribution $F$ supported on domain $D$:

$$ \mu_2 = \int_D (x-\mu_1)^2dF(x)$$

Case 1: Exact moments of an empirical distribution

A sample of size $n$ will define a discrete distribution called the empirical distribution

$$F_n(x):= \frac1n \sum_{i=1}^n \mathbf{1}_{\leq x}(x_i)$$

This is a perfectly valid probability distribution (nothing special about the fact that we chose the "jumps" based on sample values).

In this case, the second central moment of $F_n$ reduces to the sample standard deviation:

$$ \mu_2 = \int_D (x-\mu_1)^2dF(x) = \sum_{i=1}^n (x_i-\mu_1)^2(\frac1n) = \frac1n \sum_{i=1}^n (x_i-\bar x)^2$$

So the empirical distribution qua distribution has central moments just like every other distribution (e.g., Normal, Poisson, etc).

Case 2: Estimators of population moments

The second central moment is a specific example of a broader class of objects called statistical functionals $S[P]: \mathbb{P} \to \mathbb{R}$ which take as input a distribution and return a real number.

As mentioned by Andrew Zhang (and myself in other responses another question of yours) the general principle used to justify using sample statistics as estimators of population properties is called the plug-in principle (addressed at length elsewhere), which says that (under some rather general conditions), sample statistics converge (in probability) to their associated population value:

$$S[F_n] \xrightarrow{p} S[F]$$

Fundamentally, this supported by the Glivenko-Cantelli Theorem while inference on the plug-in estimator usually appeals to the central limit theorem or other similar asymptotic results. Alternative approach is to use nonparametric empirical distribution based inference using DKW inequality.

In any case, the consistency of plug in estimators is what allows us to justify using the second central moment of the empirical distribution $F_n$ as an estimator of the second central moment limiting distribution $F$ (i.e., population/underlying distribution).

Where does the Formula for the Standard Deviation come from?

2 Answers2

Linked