Why solving $A^T Ax = A^T b$ means the same as that $Ax$ is the point in the range of $A$ closest to $b$?
Can somebody please explain in detail? Thanks.
Why solving $A^T Ax = A^T b$ means the same as that $Ax$ is the point in the range of $A$ closest to $b$?
Can somebody please explain in detail? Thanks.
Short and very formal answer: if you want to minimize $\|Ax-b\|_2^2$, you should search for $x$ such that $$\nabla\|Ax-b\|^2_2 = 0.$$ But $$0 = \nabla\|Ax-b\|_2^2 = 2A^T(Ax-b) \,\,\,\, \Longleftrightarrow \,\,\,\, A^TAx = A^Tb.$$
HINT
To begin with, notice that
\begin{align*} A^{T}Ax = A^{T}b \Longleftrightarrow A^{T}(b - Ax) = 0 \Longleftrightarrow (b - Ax)\perp\mathcal{C}(A) \end{align*}
Therefore $Ax$ is the projection of $b$ onto $\mathcal{C}(A)$. Can you take from here?
In the Euclidean norm the distance is $$ d(Ax, b) = \Vert Ax-b\Vert_2 $$ We are looking for an extremum of $d$ regarding the choice of $x$, so we need the partial derivatives regarding the coordinates $x_k$ to vanish: $$ \begin{align} 0 &= \partial_k d(Ax, b) \\ &= \frac{\partial}{\partial x_k} \left( \sum_i \left( \sum_j a_{ij}x_j-b_i \right)^2 \right)^{1/2} \\ &= \frac{1}{2 d(Ax,b)} \sum_i 2 \left( \sum_j a_{ij}x_j-b_i \right) a_{ij} \delta_{jk} \\ &= \frac{1}{d(Ax,b)} \sum_i a_{ik} \left( \sum_j a_{ij}x_j-b_i \right) \\ &= \frac{1}{d(Ax,b)}\left( A^T (A x - b) \right)_k \end{align} $$ Thus we need a solution $x$ of $A^TAx = A^Tb$.
If $Ax$ is the point in $R(A)$ which is as close as possible to $b$, then the residual $r = b - Ax$ is orthogonal to $R(A)$. But the "four subspaces" theorem, which is emphasized in Gilbert Strang's books, tells us that $R(A)^\perp = N(A^T)$. Thus, $$ A^T (b - Ax) = 0 \implies A^T Ax = A^T b.$$
Alternatively, we can minimize $f(x)=(1/2) \| Ax - b \|^2$ by setting the gradient equal to $0$. By the multivariable chain rule we have $$ f'(x) = (Ax - b)^T A. $$ It follows that $$ \nabla f(x) = f'(x)^T = A^T (Ax - b). $$ So, setting the gradient equal to $0$, we obtain $$ A^T(Ax - b) = 0 \implies A^T Ax = A^T b. $$