Why solving $A^T Ax = A^T b$ means the same as that $Ax$ is the point in the range of $A$ closest to $b$?

Question

Can somebody please explain in detail? Thanks.

These are the "normal equations", and you can see discussion here too: https://math.stackexchange.com/questions/2363703/proof-of-the-normal-equations-theorem. (They wrote there $m\ge n$, but you don't actually need to assume that.) — Minus One-Twelfth, Mar 30 '19 at 05:35
I thought this question would be a duplicate, but to my surprise after ten minutes or so of searching I have not found a great math.stackexchange question about how to derive the normal equations. (The question linked above isn't a perfect duplicate because it asks about a particular way of deriving the normal equations using calculus; but that is only one approach, and the linear algebra approach is arguably more elegant.) — littleO, Mar 30 '19 at 06:23

User8128 · Accepted Answer · 2019-03-31T04:55:24.747

3

Short and very formal answer: if you want to minimize $\|Ax-b\|_2^2$, you should search for $x$ such that $$\nabla\|Ax-b\|^2_2 = 0.$$ But $$0 = \nabla\|Ax-b\|_2^2 = 2A^T(Ax-b) \,\,\,\, \Longleftrightarrow \,\,\,\, A^TAx = A^Tb.$$

edited Mar 31 '19 at 04:55

answered Mar 30 '19 at 05:17

User8128

15,485
1
18
31

score 1 · Answer 2 · answered Mar 30 '19 at 05:17

1

HINT

To begin with, notice that

\begin{align*} A^{T}Ax = A^{T}b \Longleftrightarrow A^{T}(b - Ax) = 0 \Longleftrightarrow (b - Ax)\perp\mathcal{C}(A) \end{align*}

Therefore $Ax$ is the projection of $b$ onto $\mathcal{C}(A)$. Can you take from here?

answered Mar 30 '19 at 05:17

user0102

21,572

mvw · Answer 3 · 2019-03-30T05:41:42.940

In the Euclidean norm the distance is $$ d(Ax, b) = \Vert Ax-b\Vert_2 $$ We are looking for an extremum of $d$ regarding the choice of $x$, so we need the partial derivatives regarding the coordinates $x_k$ to vanish: $$ \begin{align} 0 &= \partial_k d(Ax, b) \\ &= \frac{\partial}{\partial x_k} \left( \sum_i \left( \sum_j a_{ij}x_j-b_i \right)^2 \right)^{1/2} \\ &= \frac{1}{2 d(Ax,b)} \sum_i 2 \left( \sum_j a_{ij}x_j-b_i \right) a_{ij} \delta_{jk} \\ &= \frac{1}{d(Ax,b)} \sum_i a_{ik} \left( \sum_j a_{ij}x_j-b_i \right) \\ &= \frac{1}{d(Ax,b)}\left( A^T (A x - b) \right)_k \end{align} $$ Thus we need a solution $x$ of $A^TAx = A^Tb$.

littleO · Answer 4 · 2019-03-30T19:41:15.250

If $Ax$ is the point in $R(A)$ which is as close as possible to $b$, then the residual $r = b - Ax$ is orthogonal to $R(A)$. But the "four subspaces" theorem, which is emphasized in Gilbert Strang's books, tells us that $R(A)^\perp = N(A^T)$. Thus, $$ A^T (b - Ax) = 0 \implies A^T Ax = A^T b.$$

Alternatively, we can minimize $f(x)=(1/2) \| Ax - b \|^2$ by setting the gradient equal to $0$. By the multivariable chain rule we have $$ f'(x) = (Ax - b)^T A. $$ It follows that $$ \nabla f(x) = f'(x)^T = A^T (Ax - b). $$ So, setting the gradient equal to $0$, we obtain $$ A^T(Ax - b) = 0 \implies A^T Ax = A^T b. $$

Why solving $A^T Ax = A^T b$ means the same as that $Ax$ is the point in the range of $A$ closest to $b$?

4 Answers4

Linked