Why work with squares of error in regression analysis?

Question

In regression analysis one finds a line that fits best by minimizing the sum of squared errors.

But why squared errors? Why not use the absolute value of the error?

It seems to me that with squared errors the outlyers gain more weight. Why is that justified? And if it is justified to give the outlyers more weight, then why give them exactly this weight? Why not for example take the least sum of exponetial errors?

Edit: I am not so much interested in the fact that it might be easier to calculate. Rather the question is: does squaring the errors result in a better fitting line compared to using the absolute value of the error?

Furthermore I am looking for an answer in layman's terms that can enhance my intuitive understanding.

Squared error is like variance which is easier to work with in most instances than absolute error which is like standard deviation. The latter includes a square root. — jdods, Jul 15 '16 at 18:09
Some of the related Qs on stats.SE:relevant Q1; relevant Q2; relevant Q3... try some searches for more — Glen_b, Jul 17 '16 at 06:28
Also be aware of the Gauss-Markov Theorem: https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem — 3x89g2, Jul 17 '16 at 08:43
Probably mostly because of a combination of tradition, simplicity and convenience. Least squares is a well known algorithm, it is easy to learn and fast to compute, dates back to the days of Gauß 18th century (and was probably known even earlier). It is not until quite recently there have popped up fast algorithms for other errors like the popular sum of absolute values (ell one norm). — mathreadler, Jul 31 '16 at 13:29

score 8 · Answer 1 · answered Jul 15 '16 at 18:06

8

From a Bayesian point of view, this is equivalent to assuming that your data is generated by a line plus Gaussian noise, and finding the maximum likelihood line based on that assumption. Using the absolute values means assuming that your noise has pdf proportional to $e^{-|x|}$ which is substantially less natural than assuming Gaussian noise (e.g. Gaussian noise falls out of the central limit theorem).

Using the squared errors also makes the regression extremely easy to compute, which is probably a major practical factor. Most other functions of the error would result in something much more annoying to compute.

answered Jul 15 '16 at 18:06

Qiaochu Yuan

419,620

I don't understand the first part of your answer, could you perhaps expand on it? – GambitSquared Jul 25 '16 at 09:20
2

Why "from a Bayesian point of view" ? The same statements are true in a non-Bayesian approach. – zyx Jul 29 '16 at 21:47

score 5 · Accepted Answer · answered Aug 01 '16 at 05:49

Many insightful answers here.

I'd like to share something I came across awhile ago that might help you with your edited question:

Edit: I am not so much interested in the fact that it might be easier to calculate. Rather the question is: does squaring the errors result in a better fitting line compared to using the absolute value of the error?

Furthermore I am looking for an answer in layman's terms that can enhance my intuitive understanding.

No, squaring the errors doesn't always result in a better fitting line.

Here's a figure comparing the best fit lines produced by L-1 regression and least squares regression on a dataset with outliers:

Click here for figure

As you've pointed out, outliers adversely affect least squares regression. Here's an instance where least squares regression gives a best fit line that "pans" towards outliers.

Full credit to: matlabdatamining.blogspot.sg/2007/10/l-1-linear-regression.html

M. Cornwell · Answer 3 · 2016-07-28T21:50:25.820

You square the error terms because of the Pythagorean theorem x^2 + y^2 = z^2.

Consider just the 2-dimensional case.

The x and y correspond to error terms in each orthogonal dimension. But that hypotenuse z is the distance you really want to minimize.

Now minimizing the sum of the squares of x and y will also minimize the square root of the sum of the squares. So there is no need to take the final square root.

With a little thought you will see that this works as you add more x,y error terms to the mix. Minimizing

x1^2 + y1^2 + ... + xN^2 + yN^2

has the effect of also minimizing the over sum of the distances (all those little hypotenuses)

sqrt(x1^2 + y1^2) + ... + sqrt(xN^2 + yN^2) = z1 + ... + zN

but is much simpler to calculate.

Make sense?

Ok, so what would happen if you took absolute values and minimized

|x1| + |y1| + ... + |xN| + |yN| ?

Instead of minimizing the sum of the distances you would bias the resulting fit toward a slope of 1 or -1 and away from lines slopes near 0 or infinity. Of course you can do that, but your resulting fit will be sucked toward a line with a slope of plus or minus 1 and away from the solution that minimizes those Pythagorean distances.

I disagree really, because typically in regression we consider only the vertical differences, not the possibility that there are also horizontal differences. So there isn't really "physical" geometry going on here. We're sort of projecting our physical geometry onto a nonphysical space. — Ian, Jul 28 '16 at 21:32
Darn it! Ian is right. "Least squares" linear regression is based on vertical offsets not perpendicular offsets. So my geometric argument does not apply to this problem :-( — M. Cornwell, Jul 28 '16 at 22:09

score 3 · Answer 4 · answered Aug 01 '16 at 18:22

Basically, you can ask the same question in the much simpler setting of finding the "best" average of values $x_1,\ldots,x_n$, where I here refer to average in the general sense of finding a single value to represent them such as the (arithmetic) mean, geometric mean, median, or $l_p$-mean (not sure if that's the right name).

For data that actually come from a normal distribution, the mean will be the most powerful estimator of the true mean. However, if the distribution is long-tailed (or has extreme values) the median will be more robust.

You can also use the $l_p$ norm and find the $l_p$-mean, $u$, that minimises $\sum_i |x_i-u|^p$ for any $p\ge1$. (For $p<1$ this need no longer be unique.) For $p=2 $ we have the traditional square distance, while for $p=1$ we get the median (almost). I once found $p=1.5$ to behave well in terms of both power and robustness.

So, switching from least square regression ($l_2$-norm) to using absolute distance ($l_1$-norm) corresponds to switching from mean to median. Which is better depends on the data, and also on the context of the analysis: what you are actually looking for.

The mean does have the advantage that it is an unbiased estimator of the true mean no matter what the underlying distribution is, but usually accuracy is more important than unbiasedness.

littleO · Answer 5 · 2016-07-25T09:30:55.850

Minimizing the $\ell_2$-norm of the residual is certainly not always the best thing to do, for the reason you said: it puts too much weight on outliers. For that reason people often minimize the $\ell_1$-norm of the residual. The $\ell_1$-norm is much more robust against outliers. (The $\ell_1$-norm does not consider it to be a disaster if a few components of the residual are large.)

Other penalty functions can be useful also, such as the $\ell_\infty$-norm or the Huber penalty. This is discussed in more detail for example in chapter 6 of the book Convex Optimization by Boyd and Vandenberghe (which is free online). See example 6.2 ("robust regression") and the accompanying figure 6.5, for example.

score 1 · Answer 6 · answered Jul 31 '16 at 13:10

Your question seems to imply that least squares regression is the only method to fit a linear model. As mentioned in other answers, there are other perfectly legitimate methods that can be used to fit a linear predictor. A common thread of these methods is that they are tractable, i.e.: there are concrete steps that can be taken to find the actual solution (or rather, an approximation to the solution within some acceptable tolerance).

Tractability is not an intrinsic property of the method. What is tractable at any given time depends on the state of technological developments. If some day quantum computing becomes part of standard technological development, then the list of tractable methods will be greatly expanded.

In times of Gauss and Euler, the list of tractable methods was far more limited than our current list, and least squares was a technological advance with lasting consequences.

A second important quality of whatever method one chooses is effectiveness. Gauss use of least squares helped him make important accurate predictions in the context of astronomical observations. I speculate that faced with Gauss success, researchers wanted to know how he did it, rather than why he did what he did.

A third feature of fitting methods that tilts the scale in favour of least squares is interpretability. We seek models to abstract patterns from observations, so that we can understand differences and make predictions. The theoretical framework of least squares provides guidance in model building. Lately I had the chance to apply minimization of the sum of the absolute values of the error (with a quadratic penalty on the size of the parameters), known as LASSO. The model was selected by crossvalidation. None of the niceties regarding significance of the coefficients, which are at the core of least squares, are immediately available. For some time, much of observational sciences consisted of finding statistically significance in model parameters, because the producing a model was the goal of data analysis. Of late, there has been an increasing interest in using models to prediction expected outcomes, which have resulted in the addition of various model fitting methods to the analyst tool-box.

Since I digressed, I will summarize my answer: * least squares regression is not the only method in use to fit linear models * the methods in use are those that are tractable (can be implemented), effective (solve the problem at hand), yield interpretable results (when the need arises)

As for the popularity of least squares: * from the mid 1700's to time of wide-spread availability of computing machines, least squares regression was the state of the art in linear model fitting (disregard the objections of the Bayesians, they had conjugate pairs, but not until the late 20th century they could handle more general parameter priors) * least-squares regression, when it assumptions are met, provides a framework that can be use for guidance in model building

Now I'll digress again by addressing objections: Objector: ...but least squares minimizes a function that is differentiable..
Answer1: So? Convex minimization is well developed, numerical methods are available Answer2: it is 2016, enough with eighteenth century technology

Objector: ...but p-values, where art thou? Answer: if you need p-values to publish, then use least-squares. You can also use other methods of model fitting, and estimate the distribution of parameter estimates through, for example, bootstrapping. If what you need are predictions, then you need not worry about p-values. Use statistical methods to ensure your models are stable, and the results reproducible. The importance of p-values in the scientific literature has been overplayed, either by dishonesty or ignorance. The loss of significance or strength of relations in successive repetitions of many experiments is a well documented fact, cause by p-value significance driven models.

Objector: ... but all the hordes trained in least squares...? Answer: (speechless)

Objector: ...but should we dispose of least-squares in our model building pursues? Answer: No. There is nothing intrinsically wrong with least-squares. It applies when the hypothesis underlying the method hold (namely, Gaussian distribution of the remainders, iid observations), and in any case, least squares gives the BLUE, which is all you often need.

Hope that helps. thanks for the question.

score 0 · Answer 7 · answered Jul 17 '16 at 05:52

The formal arguments have already be given in Qiaochu Yuan's answer.

So, let me give you a small example for a linear regression of the following data points $$\left( \begin{array}{cc} x & y \\ 1 & 5 \\ 2 & 8 \\ 3 & 11 \\ 4 & 14 \\ 5 & 16 \\ 6 & 19 \\ 7 & 22 \\ 8 & 24 \\ 9 & 27 \\ 10 & 30 \end{array} \right)$$

Minimizing $$S_1=\sum_{i=1}^{10}(a+bx_i-y_i)^2$$ is just trivial (almost if you use matrix calculations). One iteration is needed if you solve the partial derivatives (set to $0$) and the result will be $$a=2.6 \qquad , \qquad b=2.72727\qquad , \qquad S_1=0.7636$$

Starting with the above values as initial estimates for the minimization of $$S_2=\sum_{i=1}^{10}|a+bx_i-y_i|$$ the solver needed twelve iterations to arrive at $$a=2.5\qquad , \qquad b=2.75000\qquad , \qquad S_2=2.2500$$ As you can see, the parameters are very close.

For the solutions, the first model would lead to $S_2=2.3636$ and the second model to $S_1=0.8125$. Again, very close but the effort is quite different.

The fact that we can differentiate $S_1$ with respect to $a$ and $b$ makes the difference.

There are many other objective functions you could consider but all of them (except the classical least square mthod) will require an optimizer, "reasonable" starting values and a nonlinear optimizer.

score 0 · Answer 8 · answered Jul 30 '16 at 20:48

A compromize between practicality and hypotheses:

Pro (1): Given a set of points $(t_i,y_i)_{1\leq i\leq N}$ a standard Least Square regression (for estimating $a$ and $b$ in $y=at+b$) is optimal under the hypotheses: (I) No errors in $t_i$, (II) uniform Gaussian errors in the $y_i$ measurements.

Contra (2): Often $t_i$ is affected by errors as well. Often errors are not gaussian. Other error estimates are nowadays relatively easy to work with on computers, even uniform norm or various other convex error functions. Many are available on standard libraries.

Pro (3): Really easy to work with and estimates are unique (with at least two $t$-independent observations). Note that with the sum of abs values you may run into non-uniqueness in particular for the $b$-estimate (essentially because $u\mapsto u^2$ is strictly convex and $u\mapsto |u|$ is not). More importantly: Often you have really no clue to what a good model for the errors look like, so why not take the simplest!

Contra (4): But,... if you do have a good idea for a model and it's not too complicated, there is an issue for abandonning LS regression!

Pro (5): Under the Gaussian hypothesis you may go quite a lot further, e.g. testing the hypothesis that $a=0$ by applying a $\chi^2$ test. This comes for free with this model. Not much available for other models (to my knowledge).

theoGR · Answer 9 · 2016-08-01T16:23:46.960

In classical statistics the basic tool for inference is the data. These come from an assumed model and provide us with information about the unknown features of the distribution that they came from. Every statistical procedure is in essence an optimisation problem. This means that under specific criteria we try to find the(or an) optimal solution i.e. a function of the data which under these specific criteria(restrictions) gives us the best estimates of the unknown features(parameters) of the distribution. In linear regression most common is the least squares criterion in which one seeks a set of values which minimise the sum of square distances(errors) of the data from their corresponding linear components.

Now, why the error of choice is squared error?

An error is how far one's opinion is from the true state of nature. It is the "price" that one has to pay for making the wrong decision.

First, why not exponential errors? Imagine how disproportionate a "price" one has to pay. Think about the exponential function. Asymmetric, grows or decays really fast, meaning that for reasonable positive deviations you pay a very heavy price but for reasonable negative ones almost nothing.

So the contest is between the absolute and the squared error. We don't want higher powers because the errors would be unfairly sensitive to moderately distant observations.

I would like to underline that this is a general approach and each of these errors has its merits depending on the particular set of data at hand.

There is the following fact. The quantity associated with minimising the absolute error is the median whereas for squared error is the mean value.

Usually the median is not so manageable distributionally whereas the mean is. The reason behind this fact is called normal distribution which, for a good reason most of time, is the assumed distribution of where the data came from.

When data are assumed to be normal, distributions of expressions involving squared deviations are readily available. Famous tests like the $t$ or $F$ test are derived from functions that involve squared distances.

In a regression setting estimating the parameters by minimising the sum of square errors provide you with:

1) The best linear estimator of the parameters

2)An unbiased estimator of the parameters

If in addition if the errors are normal one has:

3) The exact distribution of the LS estimator.

4) the exact distribution of the variance of the LS estimator.

5)The exact distribution of the residuals.

6) Ability to test analytically all of the hypotheses involving the unknown parameters of the model. Contruction of confidence intervals.

7)Consistency of the estimators for large samples.

etc etc

Use of absolute error would have not provide with such a remarkable "toolbox".

Imagine a statistician 30 years ago with no access to high speed computers. Which type of error s/he would have chosen?

Why work with squares of error in regression analysis?

9 Answers9

Linked