What is the rigorous geometric proof that the projection onto a line gives the minimum least squares error?

Question

I was trying to understand via a geometric argument why the projection (call it $p)$ of $b$ onto the line $a$ minimizes the norm of error vector $e = \| p - b \|^2_2$. i.e. I want to solve:

$$ \min_{x \in \mathbb R} { \| b - xa \|^2_{2} } $$

via a with a geometric argument.

Obviously this is very easy to solve with calculus because its a 1D calculus problem (take derivatives wrt to $x$ and set equal to zero to get $x=\frac{a^\top b}{a^\top a}$ ). However, it seems that from MIT's Gilbert Strang's 18.06 course, he keeps emphasizing that we want to choose $x$ s.t.

$$e \perp a$$

In other words it seems we should use orthogonality $ \langle e,a \rangle = \langle b-xa,a \rangle =0$ to solve the problem. This just seems to be a such a concise way of expressing the solution (only alluding to geometry via orthogonality) that it feels there must be some clean (elegant?) way to express the solution using only the orthogonality fact (or some other geometry trick maybe).

I started by expressing the error vector with dot/inner products and see where we might be able to insert the property that $e^\top a = (b-xa)^\top a =0$. So I proceeded:

$$ \| e\|^2_2 = \|b - xa \|^2_2 = (b-xa)^\top (b-xa) = (xa-b)^\top xa + (xa-b)^\top(-b) $$

I could have required $(xa-b)^\top xa = 0 $ to remove a whole term above but it wasn't obvious to me why that would be optimal.

Somehow as I drew triangles and moved the error vector around it seemed to insist to choose the perpendicular vector which made me feel either (generalized) Pythagoras theorem or the law of cosines would be useful. Since the angle that has to be 90 is the one opposite to the vector $b$ (being projected) I used that vector as the reference for the law of cosines:

$$ \|b\|^2_2 = \|e\|^2_2 + \|xa\|^2_2 - 2 \langle xa, e \rangle$$ $$ \|b\|^2_2 = \|e\|^2_2 + \|xa\|^2_2 - 2 \| xa \|_2 \| e \|_2 \cos \theta(xa,e)$$

re-arranging terms to leave the error term $\|e\|^2_2$ as the subject leads to:

$$ \| e \|^2_2 = \|b\|^2_2 - \|xa\|^2_2 + 2 \langle xa,e \rangle $$

definitively setting $\langle xa,e \rangle = 0$ definitively helps the above term decrease (which seems in the right direction because $\langle a,e \rangle = 0 \implies \langle xa,e \rangle = 0$ since $e$ will be orthogonal to any multiple of $a$). However, it seems to me if we choose a different angle such that the $\cos \theta(p,e) = -1 $ or equivalently $\langle p,e \rangle = -1 $ that would lead to a greater decrease to the error vector than if we choose a perpendicular one. Which seem really unintuitive to me. Is the choice of -1 better? Though if I draw things by hand it seems clear that perpendicularity is optimum so it seems my maths by hand here is tricking me somehow...

Essentially right now it seems I got something (maybe) useful but I am unable to convince myself (or provide a rigorous proof) that the above is optimal. Furthermore it doesn't seem a super clean solution and was concerned that I was missing a very obvious/clear argument for why the optimal solution is $ a \perp e $ (also this seems like a standard textbook thing but I can't find a rigurous proof of it in Strang's book nor any of the other ones I have). What am I missing? There must be a better solution or at least a proof that this is right.

After reflecting on my question for a bit I think what I am really looking for is for a way to conclude that the pythagorean theorem sort of the "optimal" way to deduce the error side of the triangle is as short as it can be. It just feels there must be something that leads us from the law of cosines to the Pythagorean thm to your proof. Or something like that. It just seems maybe my question is more basic and shouldn't require vectors (or knowing about basis etc), just geometry (maybe how $\cos \theta = 0$ or $u^\top v =0$). Any more advanced knowledge of vectors should be unnecessary.

Note that the calculus solution is very simple:

The norm of the error vector $e$ is:

$$\|e\|^2_2 = (b - xa)^\top (b - xa) = b^\top b - 2xa^\top b + x^2a^\top a$$

To minimize $\|e\|^2_2$, we take its derivative with respect to $x$ and set it to zero: $$\frac{d}{dx}\|e\|^2_2 = -2a^\top b + 2xa^\top a = 0 \implies a^\top b = x a^\top a$$

Thus:

$$x = \frac{a^\top b}{a^\top a}$$

Note I did see:

Why is minimizing least squares equivalent to finding the projection matrix $\hat{x}=A^Tb(A^TA)^{-1}$?

but it doesn't provide the argument I am looking for...

After reflecting on my question for a bit I think what I am really looking for is for a way to conclude that the pythagorean theorem sort of the "optimal" way to conclude the error side of the triangle is as short as it can be. It just feels there must be something that leads us from the law of cosines to the Pythagorean thm to your proof. Or something like that. It just seems maybe my question is more basic and shouldn't require vectors (or knowing about basis etc), just geometry (maybe how $\cos \theta = 0$ or $u^\top v =0$). Any more advanced knowledge of vectors should be unnecessary. — Charlie Parker, Jul 17 '17 at 15:52

littleO · Answer 1 · 2017-07-17T22:46:10.823

6

Let $V$ be a subspace of $\mathbb R^n$ and let $b \in \mathbb R^n$ be a point that does not belong to $V$. Decompose $b$ as $b = b_1 + b_2$ where $b_1 \in V$ and $b_2 \in V^\perp$. I claim that $b_1$ is the point in $V$ which is closest to $b$. Proof: If $y$ is any other point of $V$, then $b - y = b - b_1 + b_1 - y$. (Visualize this.) Because the vector $b - b_1$ is orthogonal to $b_1 -y$, the Pythagorean theorem implies that $$\|b - y \|^2 = \| b - b_1 \|^2 + \| b_1 - y \|^2 \geq \|b - b_1 \|^2.$$ This shows that $b_1$ is closer to $b$ than $y$.

In a least squares problem, we are given a matrix $A$ and a vector $b$ which does not belong to the range of $A$ (so $Ax = b$ has no solution), and we want to find the closest vector to $b$ in the range of $A$. Let $\hat b$ be the closest point to $b$ in the range of $A$. Note that $\hat b = Ax$ for some $x$. By the above discussion, the vector $b - Ax$ is orthogonal to the range of $A$. In particular, $b - Ax$ is orthogonal to each column of $A$. A concise way to state this fact is $$ A^T(Ax - b) = 0. $$ This system of equations is called the "normal equations" and by solving it we find the vector $x$ such that $\hat b = Ax$ is as close as possible to $b$.

Edit: Here's a picture I drew that gives a simpler explanation.

edited Jul 17 '17 at 22:46

answered Jul 17 '17 at 05:03

littleO

51,938

I think hidden in your answer is the assumption my question is trying to address directly which is why orthogonality is optimal (which is normal cuz sometimes some things so intuitively obvious its odd to require proof for them). What I mean is that in your question you go right off the bat and assume $b_1$,$b_2$ are the "best" or only ways to decompose $b$ if they are orthogonal and then somehow the pythagorean thm is the the optimal way to deal with things. What if I used the laws of cosines, and from there to deduce explicitly how orthogonality $a^Tb = 0$ derives the pythagorean theorem? – Charlie Parker Jul 17 '17 at 15:46
I think what my question is really looking for in the end (which I didn't realize until you answered) is why is the pythagorean theorem sort of the "optimal" way to deduce the error side of the triangle is as short as it can be? It just feels there must be something that leads us from the law of cosines to the Pythagorean thm to your proof. Or something like that. It just seems maybe my question is more basic and shouldn't require vectors, just geometry and it seems your answer requires some knowledge of vectors that I feel is unnecessary. Though I do thank you for your answer. – Charlie Parker Jul 17 '17 at 15:48
1

@CharlieParker You're right, I was proving something a little more general than what you asked about. I drew a picture that might answer your question more directly. We could avoid mentioning vectors entirely by saying $d(b,y)^2 = d(b,p)^2 + d(p,y)^2$. – littleO Jul 17 '17 at 22:47
wow thats was super intuitive! I liked it a lot. Appreciate your addition of the picture. But now that I reflect on it, it seems that its due to the triangle inequality, no? I can't see how orthogonality is explicitly connected without known apriori pythagoras theorem. – Charlie Parker Jul 19 '17 at 05:43

What is the rigorous geometric proof that the projection onto a line gives the minimum least squares error?

1 Answers1