2

The statement is "This least square problem can be solved efficiently, when A is of full rank. We can prove that the best vector has to satisfy the equation (A T A)x = A T b."

Can anybody explain this?

2 Answers2

2

Calculus background: Suppose that $\mathcal{M}$ is a linear subspace of $\mathbb{R}^{n}$, and suppose that $b$ is a vector in $\mathbb{R}^{n}$. For example, maybe $\mathcal{M}$ is a line or plane through the origin in $\mathbb{R}^{3}$, and $b$ is a point in $\mathbb{R}^{3}$. The closest-point and orthogonal projections of $b$ onto $\mathcal{M}$ are the same (unique) point $m$. So you can trade least-squares distance problems for algebraic orthogonality conditions. This is true in every finite-dimensional space. That is, the distance function $$ d(m')=|b-m'| $$ achieves its minimum at $m \in \mathcal{M}$ iff the vector from $b$ to $m \in \mathcal{M}$ is orthogonal to everything in $\mathcal{M}$; i.e., $$ (b-m)\cdot m'=0,\;\;\; m'\in \mathcal{M}. $$ This is taught in Calculus: Given $b \in \mathbb{R}^{n}$, one can find the closest point to $b$ on a line or on a plane through the origin by finding an orthogonal projection of $b$ onto the line or the plane. The closest point projection of $b$ onto a subspace in $\mathbb{R}^{n}$ is the same as the orthogonal projection of a point $b$ onto that same subspace. That's the finite-dimensional version of the least-squares principle.

Your Least Squares Problem: Given a fixed $b\in\mathbb{R}^{n}$, you want to minimize $d(x)=|Ax-b|$. If $A$ is not invertible, that may not have a unique solution. However, if you let $\mathcal{M}$ be the range of $A$ (which is a subspace,) then there is a unique $m \in \mathcal{M}$ such that $$ |b-m| \le |b-Ax|,\;\;\; x \in \mathbb{R}^{n}. $$ If $A$ is invertible, there is a unique $y$ such that $m=Ay$, and that unique $y$ is determined by the orthogonality conditions $$ (b-Ay)\cdot Ax = 0, \;\; x \in \mathbb{R}^{n}. $$ Equivalently, $$ (A^{T}b-A^{T}Ay)\cdot x = 0,\;\; x \in \mathbb{R}^{n}. $$ Because this must hold for all such $x$, then the above is equivalent to $$ A^{T}b-A^{T}Ay=0. $$

Disintegrating By Parts
  • 87,459
  • 5
  • 65
  • 149
1

From Kalman Filtering: Theory and Practice by Grewal and Andrews:

We are given a rectangular matrix $A$ which is $m$ by $n,$ with $$ m \geq n. $$

Then we wish to minimize $$ \parallel Ax-b \parallel^2 = (Ax-b)^T (Ax-b) = x^T A^T A x - 2 x^T A^T b + b^T b. $$

You need to know that the transpose of a 1 by 1 matrix is itself, so $$ x^T A^T b = b^T A x $$

As a multivariable function, the variables being the entries of $x,$ the gradient written as a column vector is $$ 2 A^T A x - 2 A^T b = 2 (A^T A x - A^T b). $$

Note that $A^T A$ is $n$ by $n,$ that being the smaller dimension of $m,n.$ If $A$ is full rank, meaning rank $n,$ then $A^T A$ is not just positive semidefinite, it is actually positive definite. Either way, $$ A^T Ax = A^T b $$ is often called the "normal form" of the least squares problem, or the "normal equation."

Writing $$ P = A^T A, \; \; \; v = A^T b, $$ our system is now $$ P x = v. $$ I suppose the relative speed of solving this means that a Cholesky decomposition for $P$ can be found quickly enough, the inverse of that is quick enough, so $$ x = P^{-1} v. $$ In actual practice, i'm not sure they actually find $P^{-1},$ there are other ways.

Will Jagy
  • 139,541