Why is variance squared?

Question

The mean absolute deviation is:

$$\dfrac{\sum_{i=1}^{n}|x_i-\bar x|}{n}$$

The variance is: $$\dfrac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n-1}$$

So the mean deviation and the variance are measuring the same thing, yet variance requires squaring the difference. Why? Squaring always gives a non-negative value, but the absolute value is also a non-negative value.
Why isn't it $|x_i-\bar x|^2$, then? Squaring just enlarges, why do we need to do this?

A similar question is here, but mine is a little different.

Thanks.

One advantage is that you can take derivatives without worry. $f(x)=x^2$ is differentiable, but $g(x) = |x|$ isn't. — MPW, Mar 18 '14 at 21:56
Isn't your first question answered in the question you linked? If I understand the second question correctly, the answer might be simply noting that $|x_i - \bar{x}|^2 = (x_i-\bar{x})^2$. — JiK, Mar 18 '14 at 21:57
Variance/Standard Deviation also tend to be used because of the Central Limit Theorem — Mark Bennet, Mar 18 '14 at 22:03
This might be interesting: http://math.stackexchange.com/questions/700160/intuition-behind-variance-forumla — Cm7F7Bb, Mar 18 '14 at 22:03
The canonical answer should be this one: https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia?rq=1 — leonbloy, Sep 24 '18 at 03:33

score 27 · Answer 1 · edited Sep 02 '16 at 17:01

A late answer, just for completeness with a different view on the thing.

You might look at your data as measured in a multidimensional space, where each subject is a dimension and each item is a vector in that space from the origin towards the items' measurement over the full subject's space.
Additional remark: this view of things has an additional nice flavour because it uncovers the condition, that the subjects are assumend independent of each other. This is to have the data-space euclidean; changes in that independence-condition require then changes in the mathematics of the space: it has correlated (or "oblique") axes.

Now the distance of one vector-arrowhead to another is just the formula for distances in the Euclidean space, the squarerroot of squares of distances-of-coordinates (from the Pythagorean theorem) : $$d = \sqrt { (x_1-y_1)^2+(x_2-y_2)^2+ \cdots+(x_n-y_n)^2}$$ And the standard-deviation is that value, normed by the number of subjects, if the mean-vector is taken as the $y$-vector. $$\text{sdev} = \sqrt { {(x_1- \bar x)^2 +(x_2-\bar x)^2+ \cdots +(x_n-\bar x)^2 \over n} }$$

This postpones the problem to why Euclidean distances would be more adapted than others to the situation. — Did, Sep 19 '14 at 08:54
@Did: true. Anyway: for me this was something like "enlightement" when I realized this way of viewing things. Perhaps it is possible to even enhance the aspects which give it a "natural meaning" - like the identifying of the resultant of (physical) forces with that of (for instance:psychometric) items in factor analysis and regression et al. As I said in the first sentence: a completion/an additional view... — Gottfried Helms, Sep 19 '14 at 09:17

score 9 · Answer 2 · answered Mar 18 '14 at 22:02

They don't measure the same thing. To see this, think about physical units.

Suppose the value of $x$ is measured in seconds. For example, $n$ people do a 100-meter race and the values $x_i$ are how many seconds it took each one to finish.

The formula $|x_i - \bar x|$ measures the difference of two times, so it's also measured in seconds.

The mean absolute deviation is therefore an average of second-values, so it's also measured in seconds.

However, the formula $(x_i - \bar x)^2$ squares the difference of two times, so it's measured in seconds squared. The variance is therefore also in seconds squared. They don't belong to the same physical space of variables, so they measure different things.

The standard deviation, however (the square root of the variance) is again measured in seconds, so it measures something similar (at least, physically similar).

As for why we like the square-root-of-average-of-squares better than the average-of-absolute-values - the square has better mathematical properties, as shown in other answers and in the link you referred to (particularly Rich's answer).

Michael Hardy · Answer 3 · 2020-01-12T22:54:56.803

You say the variance is $\dfrac{\sum_{i=1}^ n(x_i-\bar x)^2}{n-1}$.

What if I told you the variance is $\dfrac{\sum_{i=1}^n(x_i-\bar x)^2} n$?

You can find both in textbooks. The fact is, dividing by $n-1$ rather than $n$ is properly done (if at all) ONLY when one is estimating the population variance by using a finite sample $x_1,\ldots,x_n$ that is not the whole population. If $x_1,\ldots,x_n$ is the whole population and each point is equally probable, then the variance of that population is given by the second expression above, not the first.

Now here's the important point:

\begin{align} & \operatorname{var}(X_1+\cdots+X_n) \\[8pt] = {} & \operatorname{var}(X_1) + \cdots + \operatorname{var}(X_n) \tag 1 \end{align} if $X_1,\ldots,X_n$ are independent random variables.

That does not work with mean absolute deviation. (It also does not work in the version with $n-1$ instead of $n$.)

Now suppose $n=1800$ and each $X_i$ is the number of "heads" seen on the $i$th coin toss, so $X_i$ is either $0$ or $1$. Then the sum is the number of "heads" in $1800$ tosses. What is the probability that that number is at least $890$ but not more than $905$? To answer that, one approximates the distribution of the number of "heads" by the normal distribution with the same expected value and the same variance. Without the identity $(1)$, one would not know what that variance is! Abraham de Moivre discovered all this in the $18$th century. And that is why standard deviations rather than mean absolute deviations are used.

balaks · Answer 4 · 2016-09-02T00:37:33.097

There is a very simple explanation for this: it allows for the calculation of analytical solutions for many interesting problems.

As others have pointed out before, $x^2$ is differentiable, whereas $|x|$ is not. Hence, in problems where quadratic terms are present, one can differentiate them to find optimal solutions analytically.

On the other hand, with $|x|$, one often has to resort to numerical schemes to handle the absolute value. Another flip side to using quadratic terms is that the outliers (i.e. large and small $x$ values) have a much higher influence on the $x^2$ terms when compared to their influence on $|x|$. This may be good or bad depending on your application.

score 5 · Answer 5 · answered Mar 18 '14 at 22:10

5

If you don't have a preference for exactly how you measure deviation, then you should choose the measure that's easiest to compute with.

The standard deviation -- the square root of variance -- is rather nice for doing actual computations, because the variance has all sorts of nice properties. e.g. the function defining variance is everywhere differentiable (in fact, it's analytic), and is additive: i.e. $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$.

answered Mar 18 '14 at 22:10

2

The identity $\operatorname{var}(X+Y) = \operatorname{var}(X) + \operatorname{var}(Y)$ does NOT work with the definition of "variance" given in the question. (See my posted answer.) $\qquad$ – Michael Hardy Sep 02 '16 at 17:03
Also, it works only when $X,Y$ are uncorrelated, so that should get mentioned here. – Michael Hardy Nov 16 '19 at 00:20

score 3 · Answer 6 · answered Mar 18 '14 at 21:59

Variance is, as you say, a measure of deviation. Or, rather, standard deviation (the square root of the variance) is a measure of deviation. So it's really standard deviation and average deviation you ought to compare.

The difference is the following: If $d_i = |x_i-\bar x|$ are the absolute value deviations, then average deviation is $$ \frac{d_1 + d_2 + \cdots + d_n}{n} $$ while standard deviation is $$ \sqrt{\frac{d_1^2 + d_2^2 + \cdots + d_n^2}{n}} $$ The normal average uses what is called the arithmetic mean, and the standard deviation uses what is called the quadratic mean. It is not very difficult to show that, as long as not all the $d_i$ are equal, the standard deviation is strictly larger.

So standard deviation is more affected by outliers than is the average deviation. That is really all there is to it.

score 2 · Answer 7 · answered Mar 18 '14 at 21:56

2

They don't measure the same thing. The mean absolute deviation and standard deviation measure the same thing (notice the similarity of their names).

The variance is convenient because it satisfies the property that the variance of independent random variables is the sum of the variances.

answered Mar 18 '14 at 21:56

arsmath

2,073

In what sense do mean absolute deviation and standard deviation measure the same thing? – JiK Mar 18 '14 at 21:58
I assume the asker means it in the vague sense that they measure the "spread" of the random variable. For example, the "spread" of cX, where c is a constant and X is a random variable should be c times the spread of X. Obviously they are not exactly the same quantities, but the asker sounded to me like they were clear on this point. – arsmath Mar 18 '14 at 22:03

score 2 · Answer 8 · answered Mar 18 '14 at 21:58

First of all $|\cdot|^2$ is exactly the same with $(\cdot)^2$ for real $x$. As you mentioned they have some similar characteristics but for many problems coming out of optimization involving Gaussian densities, the optimum result is achieved by squaring. You might want to have a look at viterbi detector for example or lets give another example from estimation theory, which is the energy detector.

One can still use the sample absolute deviation instead of sample variance and can obtain a very good performance but for the examples which I gave the result will NOT be optimum.

score 2 · Answer 9 · answered Mar 18 '14 at 22:00

A similar case arises in the linear regression where the "least square method" is used, instead for example of a (fictitious) "least absolut values method". In that case the reason is that squaring has better properties concerning the derivative (minimizing the variability).

In the above case apply similar reasons, that have to do with estimating the bias (of the corresponding sample measure) or making other calculations such as determining the distribution of a sample statistic. Moreover squaring the absolute value is the same as squaring the value itself, i.e. $$|x_i-\bar x|^2=(x_i-\bar x)^2$$ so that this alteration does not lead to a noticeable difference.

justasking · Answer 10 · 2020-10-18T15:49:03.440

Because $\sum_{i=1}^n x_i - \bar x = (\sum_{i=1}^n x_i) - n\bar x = n\bar x - n\bar x = 0$; the average distance from the mean must be zero, by definition (the sum of all the values = the mean times the number of values.) However, if you square, the values which are lower than the mean don't contribute 'negatively' to the sum, cancelling the positive ones. Thus, you get the sum of the actual distances squared (which you can then square root to get back to the original units.) You could also use an absolute to do this.

score 0 · Answer 11 · answered May 11 '19 at 14:46

The formulas for variance and standard deviation fall out very nicely from the geometry of the statistics. This approach is not taught in any standard stats courses, but it should be. The answer above discussing subject space (versus variable space) is on the right track. Almost all scatter plots we study are in variable space, for example, x axis is "height", and y axis is "weight." But a fundamentally different approach to plotting data is "subject space," where for example, the x axis is Subject 1, and the y axis is Subject 2. Due to flipping things around like this, in subject space, each variable is plotted as a vector. If the vector is centered, its length is proportional to its standard deviation. The length of the vector is calculated by the Pythagorean theorem, which uses squares; therefore, the formula for calculating the SD and variance uses squares or square roots. By the way, the correlation between variables is the cosine of the angle between vectors (remember in subject space, each vector represents a variable), which leads directly to the formula for calculating the Pearson correlation coefficient. Calculation of P values also is explained nicely by this framework. These concepts are described well in the books "Geometry of Multivariate Statistics" by Thomas Wickens, and "Statistical Methods: The Geometric Approach" by David Saville and Graham Wood. Historically, Sir Ronald Fisher invented these concepts, and the field of parametric statistics, around 1910-1930. Although a genius, he was a poor teacher. As a result, no one really understood how he invented these concepts or were able to teach them for decades, until people like Wickens, Saville and Wood figured it out and were able to teach it.

score 0 · Answer 12 · answered Oct 25 '22 at 14:37

The long and the short is that the squared deviation has a unique, easily obtainable minimizer (the arithmetic mean), and an inherent connection to the normal distribution. The absolute deviation, on the other hand, can admit multiple non-unique, potentially laborious to obtain minimizers (medians). For a simple illustration of this, observe that the set $\{0,1\}$ admits for a value $x$ the total absolute deviation ($L_1$ norm) $$|x-0|+|x-1|=\begin{cases}1-2x,&x\le0 \\1,&0<x\le1 \\2x-1,&1<x\end{cases}$$ which can be seen to be a piecewise linear/constant function minimized to $1$ by all $x$ in $[0,1]$. Instances with more points may be even more pathological and not admit a simple method of optimization. On the other hand, the total squared deviation ($L_2$ norm) of the same set would be $(x-0)^2+(x-1)^2=2x^{2}-2x+1$, a quadratic function with a unique minimizer of $x=0.5$, easily obtainable by setting its derivative to zero.

The connection of squared deviations to the normal distribution is highly attractive, first for the distribution's ubiquitous applicability to real world phenomena (hence the name), for instance, for dispersed measurements taken from populations or for errors in measurements. Second, the connection is attractive due to the normal distribution's enormously convenient theoretical properties, for instance, since normal distributions are symmetric about their means, have easily obtainable centers and dispersions, are closed under summation, and so on. Furthermore, from a practical point of view, there is extensive theoretical groundwork already established for the normal distribution, which is opportune to lean on.

These characteristics can ultimately be seen as consequences of the various convenient mathematical properties of $x^2$ lacked by $|x|$, e.g. differentiability everywhere (facilitating minimization), that the set of quadratic functions are closed under summation (the sum of two quadratics is another quadratic), and so on.

So this is not to say that absolute deviations are not used or less applicable than squared deviations. On the contrary. Instead, they are, in many relevant ways, less convenient to apply.

Why is variance squared?

12 Answers12

Linked

Related