Why is minimizing least squares equivalent to finding the projection matrix $\hat{x}=A^Tb(A^TA)^{-1}$?

Question

I understand the derivation for $\hat{x}=A^Tb(A^TA)^{-1}$, but I'm having trouble explicitly connecting it to least squares regression.

So suppose we have a system of equations: $A=\begin{bmatrix}1 & 1\\1 & 2\\1 &3\end{bmatrix}, x=\begin{bmatrix}C\\D\end{bmatrix}, b=\begin{bmatrix}1\\2\\2\end{bmatrix}$

Using $\hat{x}=A^Tb(A^TA)^{-1}$, we know that $D=\frac{1}{2}, C=\frac{2}{3}$. But this is also equivalent to minimizing the sum of squares: $e^2_1+e^2_2+e^2_3 = (C+D-1)^2+(C+2D-2)^2+(C+3D-2)^2$.

I know the linear algebra approach is finding a hyperplane that minimizes the distance between points and the plane, but I'm having trouble understanding why it minimizes the squared distance. My intuition tells me it should minimize absolute distance, but I know this is wrong because it's possible for there to be non-unique solutions.

Why is this so? Any help would be greatly appreciated. Thanks!

Check your derivation for $\hat{x}$ and you will find (as @Ian points out) the expression should be $\hat{x}=(A^TA)^{-1}A^Tb$. Perhaps you should edit this into the Question, though it may be the crux of why you had trouble "connecting it to least squares regression" (minimization). — hardmath, Aug 22 '16 at 15:00

Ian · Accepted Answer · 2016-08-22T15:01:21.253

You should be multiplying by $(A^T A)^{-1}$ on the left, not the right. Anyway, the geometric point is that you want $Ax-b$ to be perpendicular to $Ay$ for every vector $y$. (I think this is most easily seen by a geometric argument, which can be easily found in books, but which I can't easily render here.) This translates to $(Ay)^T(Ax-b)=0$ for every $y$, which is the same as $y^T(A^T(Ax-b))=0$ for every $y$. This can only happen if $A^T(Ax-b)=0$, which rearranges to your form if $A^T A$ is invertible (as is usually the case).

Also, it is the same to minimize the square of the Euclidean distance as it is to minimize the Euclidean distance itself. (This is also true of any other nonnegative quantity.) What would be different is minimizing some other distance, like the "taxicab" distance (where you sum the absolute values). Why we should choose to minimize the Euclidean distance in the first place is not a purely mathematical question, it depends on where the problem is coming from. That question is a bit off-topic here, though, and has also been asked before on MSE. (The short version of that discussion: "it's mathematically convenient" and "see the Gauss-Markov theorem".)

About minimizing the Euclidean distance: I'm a little confused about the distinction. For example, least squares would penalize a point for being farther away: e.g. a point 1 unit away would be penalized by 1 whereas a point 2 units away would be penalized by 4. I think my confusion lies in the difference between minimizing an aggregate of points vs individual ones, since it seems to me that minimizing euclidean distance for each point wouldn't account for the distance factor in least squares. — algebroic, Aug 22 '16 at 15:30
In your problem the goal is to choose a vector in a certain 2D subspace which is closest to another vector in 3D space, where you measure distance in the Euclidean sense. This is not measured separately by components but rather through the usual distance formula. — Ian, Aug 22 '16 at 15:39

dantopa · Answer 2 · 2017-03-30T03:06:06.607

The matrix has full column rank; we are guaranteed a unique solution.

Problem statement

$$ \begin{align} \mathbf{A} x &= b \\ \left[ \begin{array}{cc} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{array} \right] % \left[ \begin{array}{cc} x_{1} \\ x_{2} \\ \end{array} \right] &= \left[ \begin{array}{cc} 1 \\ 2 \\ 2 \\ \end{array} \right] \end{align} $$

Normal equations

$$ \begin{align} \mathbf{A}^{*} \mathbf{A} x &= \mathbf{A}^{*} b \\ % \left[ \begin{array}{ccc} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{array} \right] % \left[ \begin{array}{cc} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{array} \right] % \left[ \begin{array}{cc} x_{1} \\ x_{2} \\ \end{array} \right] &= \left[ \begin{array}{ccc} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{array} \right] % \left[ \begin{array}{cc} 1 \\ 2 \\ 2 \\ \end{array} \right] \\[3pt] % \left[ \begin{array}{cc} 3 & 6 \\ 6 & 14 \\ \end{array} \right] % \left[ \begin{array}{cc} x_{1} \\ x_{2} \\ \end{array} \right] % &= % \left[ \begin{array}{cc} 5 \\ 11 \\ \end{array} \right] % \end{align} $$

Least squares solution

$$ \begin{align} x_{LS} &= \left( \mathbf{A}^{*} \mathbf{A} \right)^{-1} \mathbf{A}^{*} b \\ % \left[ \begin{array}{cc} x_{1} \\ x_{2} \\ \end{array} \right] &= \frac{1}{6} \left[ \begin{array}{cc} 14 & -6 \\ -6 & 3 \\ \end{array} \right] % \left[ \begin{array}{cc} 5 \\ 11 \end{array} \right] \\ % &= \frac{1}{6} \left[ \begin{array}{cc} 4 \\ 3 \end{array} \right] % \end{align} $$

Why is minimizing least squares equivalent to finding the projection matrix $\hat{x}=A^Tb(A^TA)^{-1}$?

2 Answers2

Linked