I understand the derivation for $\hat{x}=A^Tb(A^TA)^{-1}$, but I'm having trouble explicitly connecting it to least squares regression.
So suppose we have a system of equations: $A=\begin{bmatrix}1 & 1\\1 & 2\\1 &3\end{bmatrix}, x=\begin{bmatrix}C\\D\end{bmatrix}, b=\begin{bmatrix}1\\2\\2\end{bmatrix}$
Using $\hat{x}=A^Tb(A^TA)^{-1}$, we know that $D=\frac{1}{2}, C=\frac{2}{3}$. But this is also equivalent to minimizing the sum of squares: $e^2_1+e^2_2+e^2_3 = (C+D-1)^2+(C+2D-2)^2+(C+3D-2)^2$.
I know the linear algebra approach is finding a hyperplane that minimizes the distance between points and the plane, but I'm having trouble understanding why it minimizes the squared distance. My intuition tells me it should minimize absolute distance, but I know this is wrong because it's possible for there to be non-unique solutions.
Why is this so? Any help would be greatly appreciated. Thanks!