Section 2.9 The Moore-Penrose Pseudoinverse of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says the following:
Matrix inversion is not defined for matrices that are not square. Suppose we want to make a left-inverse $\mathbf{B}$ of a matrix $\mathbf{A}$ so that we can solve a linear equation
$$\mathbf{A} \mathbf{x} = \mathbf{y} \tag{2.44}$$
by left-multiplying each side to obtain
$$\mathbf{x} = \mathbf{B} \mathbf{y}. \tag{2.45}$$
Depending on the structure of the problem, it may not be possible to design a unique mapping from $\mathbf{A}$ to $\mathbf{B}$.
If $\mathbf{A}$ is taller than it is wide, then it is possible for this equation to have no solution. If $\mathbf{A}$ is wider than it is tall, then there could be multiple possible solutions. The Moore-Penrose pseudoinverse enables us to make some headway in these cases. The pseudoinverse of $\mathbf{A}$ is defined as a matrix
$$\mathbf{A}^+ = \lim_{\alpha \searrow 0^+}(\mathbf{A}^T \mathbf{A} + \alpha \mathbf{I} )^{-1} \mathbf{A}^T. \tag{2.46}$$
Practical algorithms for computing the pseudoinverse are based not on this definition, but rather on the formula
$$\mathbf{A}^+ = \mathbf{V} \mathbf{D}^+ \mathbf{U}^T, \tag{2.47}$$
where $\mathbf{U}$, $\mathbf{D}$ and $\mathbf{V}$ are the singular value decomposition of $\mathbf{A}$, and the pseudoinverse $\mathbf{D}^+$ of a diagonal matrix $\mathbf{D}$ is obtained by taking the reciprocal of its nonzero elements then taking the transpose of the resulting matrix.
When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.
When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.
It's this last part that I'm wondering about:
When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.
When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.
What I found confusing here is that the Euclidean norms $\vert \vert \mathbf{x} \vert \vert_2$ and $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$ seemingly come out of nowhere. Prior to this section, there is no discussion of the Euclidean norm -- only of the mechanics of the Moore-Penrose Pseudoinverse. And the authors then just assert this part without explanation.
So I am left wondering the following:
Why is it that, when $\mathbf{A}$ has more columns than rows, then using the pseudoinverse gives us the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions?
Why is it that, when $\mathbf{A}$ has more rows than columns, then using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$?
And what are the mechanics involved here?
I would greatly appreciate it if people would please take the time to clarify this.
$\min\limits_{x} \frac{1}{2}|Ax-y|^2$
Obviously the solution to this problem will be a vector $x$ such $Ax$ is as close to $y$ as possible.
Let $x^*$ be a solution, the optimality conditions are,
$0 = A^* (Ax^* - y)$
where $A^$ is the adjoint of $A$. Naively solving this for $x^$ we find,
$x^* = (A^A)^{-1}(A^ y)$. Compare this with your pseudo inverse. Try to use the same idea to explore the other scenario.
– Jürgen Sukumaran Jan 28 '20 at 12:29