3

Section 2.9 The Moore-Penrose Pseudoinverse of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says the following:

Matrix inversion is not defined for matrices that are not square. Suppose we want to make a left-inverse $\mathbf{B}$ of a matrix $\mathbf{A}$ so that we can solve a linear equation

$$\mathbf{A} \mathbf{x} = \mathbf{y} \tag{2.44}$$

by left-multiplying each side to obtain

$$\mathbf{x} = \mathbf{B} \mathbf{y}. \tag{2.45}$$

Depending on the structure of the problem, it may not be possible to design a unique mapping from $\mathbf{A}$ to $\mathbf{B}$.

If $\mathbf{A}$ is taller than it is wide, then it is possible for this equation to have no solution. If $\mathbf{A}$ is wider than it is tall, then there could be multiple possible solutions. The Moore-Penrose pseudoinverse enables us to make some headway in these cases. The pseudoinverse of $\mathbf{A}$ is defined as a matrix

$$\mathbf{A}^+ = \lim_{\alpha \searrow 0^+}(\mathbf{A}^T \mathbf{A} + \alpha \mathbf{I} )^{-1} \mathbf{A}^T. \tag{2.46}$$

Practical algorithms for computing the pseudoinverse are based not on this definition, but rather on the formula

$$\mathbf{A}^+ = \mathbf{V} \mathbf{D}^+ \mathbf{U}^T, \tag{2.47}$$

where $\mathbf{U}$, $\mathbf{D}$ and $\mathbf{V}$ are the singular value decomposition of $\mathbf{A}$, and the pseudoinverse $\mathbf{D}^+$ of a diagonal matrix $\mathbf{D}$ is obtained by taking the reciprocal of its nonzero elements then taking the transpose of the resulting matrix.

When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.

When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.

It's this last part that I'm wondering about:

When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.

When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.

What I found confusing here is that the Euclidean norms $\vert \vert \mathbf{x} \vert \vert_2$ and $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$ seemingly come out of nowhere. Prior to this section, there is no discussion of the Euclidean norm -- only of the mechanics of the Moore-Penrose Pseudoinverse. And the authors then just assert this part without explanation.

So I am left wondering the following:

  1. Why is it that, when $\mathbf{A}$ has more columns than rows, then using the pseudoinverse gives us the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions?

  2. Why is it that, when $\mathbf{A}$ has more rows than columns, then using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$?

And what are the mechanics involved here?

I would greatly appreciate it if people would please take the time to clarify this.

Ayush
  • 39
The Pointer
  • 4,182
  • Your questions 1. and 2. are the same. – Jürgen Sukumaran Jan 28 '20 at 12:24
  • @TSF Good catch. Thanks for that! – The Pointer Jan 28 '20 at 12:26
  • 1
    Consider the optimization problem,

    $\min\limits_{x} \frac{1}{2}|Ax-y|^2$

    Obviously the solution to this problem will be a vector $x$ such $Ax$ is as close to $y$ as possible.

    Let $x^*$ be a solution, the optimality conditions are,

    $0 = A^* (Ax^* - y)$

    where $A^$ is the adjoint of $A$. Naively solving this for $x^$ we find,

    $x^* = (A^A)^{-1}(A^ y)$. Compare this with your pseudo inverse. Try to use the same idea to explore the other scenario.

    – Jürgen Sukumaran Jan 28 '20 at 12:29
  • 2
    I just want to point out that it's not the only place in linear algebra where the Euclidean norm pops out "magically". For example if you look at eigenvalues (which is purely a "linear" concept), they have variational characterization in terms of Euclidean norm (e.g. the top eigenvalue is the maximum of $|| Ax ||_2$ on the unit Euclidean sphere). So it's not really surprising to see other extremal properties involving Euclidean norm coming out from linear algebra – md5 Jan 28 '20 at 12:38
  • @TSF Ahh, what you've said makes sense, but I don't understand why the adjoint is in there? Could you perhaps post a more-comprehensive, full answer? – The Pointer Jan 28 '20 at 12:42
  • @md5 That's interesting. Thanks for the information. This seems odd, since, in linear algebra, we don't necessarily have to work in Euclidean space, right? I wonder if it still "pops out" if we're working in other spaces? What do you think? – The Pointer Jan 28 '20 at 12:44
  • The adjoint is there because I took the gradient of the function $f(x) = \frac{1}{2}|Ax-y|^2$. Anyways, there is a problem with naively trying what I had written above because $(A^*A)^{-1}$ might not be invertible, i.e. it might have some eigenvalues which are $0$. To amend this, we can add a little cushion to the eigenvalues in the form of $\alpha I$ and see what happens as the cushion disappears. That's basically the idea behind taking the limit in the definition of pseudoinverse. This is not very rigorous, obviously, and why I wanted to keep it to a comment. – Jürgen Sukumaran Jan 28 '20 at 12:56
  • @ThePointer: you're right, it makes sense only if $A$ is hermitian. I think you can see it that way: even if you can define these concepts (pseudo-inverse, eigenvalues, etc.) in more general settings, the fact that you're doing linear algebra on a space that has a lot of structure brings new interpretations of these simple objects – md5 Jan 28 '20 at 13:00
  • @md5 Hmm, so are you saying that the Euclidean norm "pops out" only if $A$ is hermitian? – The Pointer Jan 28 '20 at 13:02
  • @TSF Hmm, ok. Thanks for the information, anyway! Some of this goes above my head, so guess I'll have to hope for a more comprehensive, hand-holding-type answer. If I don't get anything within the next 2 days, I'll add a bounty, since this seems quite interesting! – The Pointer Jan 28 '20 at 13:06
  • The book is misleading if not wrong. The equation $Ax=y$ can have no solution or infinitely many solutions. This has nothing to do with whether $A$ has more column than rows or more rows than columns. E.g. (1) if $A=0$ and $y\ne0$, $Ax=y$ has no solution, regardless of the size of $A$; (2) if $y=(1,0,\ldots,0)^T$, $A$ is a matrix with at least two columns and the only nonzero entry of $A$ is a $1$ at the $(1,1)$-th position, then $Ax=y$ has infinitely many solutions (of the form $x=(1,\ast,\cdots,\ast)^T$), regardless of the size of $A$. – user1551 Jan 28 '20 at 14:14
  • @user1551 Then what would be a more accurate explanation? – The Pointer Jan 28 '20 at 14:49
  • 4
    @ThePointer This is explained in virtually every introductory linear algebra textbook. $y=Ax$ has a solution if and only if $y$ lies inside the column space of $A$. And whenever it has a solution, it has infinitely many solutions if and only if the kernel/null space of $A$ is non-trivial. That $A$ has more columns than rows is a sufficient condition but not a necessary condition for the null space of $A$ to be non-trivial. Even if the null space is non-trivial, that doesn't mean $y=Ax$ has infinitely many solutions, because $y$ may not lie inside the column space of $A$ in the first place. – user1551 Jan 28 '20 at 16:38
  • @user1551 I see. Thanks for the clarification. – The Pointer Jan 28 '20 at 16:47
  • Actually, it is a very good question to ask why the Euclidean norm shows up out of nowhere. The reason is that the Euclidean norm was lurking in the background the moment we used $A^T$ in the definition of the pseudoinverse. Transposition and the Euclidean norm are intimately linked by the fact that $|x|_2^2=x^Tx$. If you were working in a vector space in which the natural inner product was not $x^Ty$ but some other inner product, say $\langle x,y\rangle = x^TGy$ for some positive definite matrix $G$, then the pseudoinverse would not be the right tool to use. –  Feb 06 '20 at 11:17

3 Answers3

2

Eqn. (2.46) proposes to look at the minimizer $x_\alpha$ of the functional $$J_\alpha(x) := |A x - y|^2 + \alpha |x|^2.$$ For any finite $\alpha > 0$, the functional is strictly convex and has a unique minimizer $x_\alpha$; it is the smallest among those $x$ that produce the same residual magnitude $|A x - y|$. Minimization wrt $x$ gives $x_\alpha = (A^\top A + \alpha I)^{-1} A^\top y$. To see this, write the norm $|\cdot|^2$ in terms of the scalar product $\langle \cdot, \cdot \rangle$.

Ad 1. Suppose $A x = y$ has a solution $x^*$. The set of solutions is the convex set $(x^* + \ker A)$. So, there is only one solution that has minimal norm: the orthogonal projection of $0$ onto that set. As $\alpha \searrow 0$, the residual term becomes more important, and $A x = y$ is eventually enforced. Therefore, $x_0 := \lim_{\alpha \searrow 0} x_\alpha$ is the minimal-norm solution of $A x = y$.

Ad 2. If $A x = y$ has no solution, the residual $|A x - y|$ still has a minimum, which is selected for in the limit $\alpha \searrow 0$.

user66081
  • 3,987
  • 17
  • 29
  • Doing the calculation for the convex set in "Ad 1." gives a resulting matrix with values but not a "set of solutions" from my understanding. Does the calculation only produce another one of the potential solutions or a range between what is the Euclidean norm and the point at which conditions are broken for the $Ax=y$ equation to be true? – Hendrix13 Dec 19 '22 at 10:09
2

Let $x$ be $A^+y$.

  1. Let me begin by the second point. For all $z$, we have: \begin{align} \lVert Az-y \rVert_2^2 &= \lVert Ax-y \rVert_2^2 + \lVert A(z-x) \rVert_2^2 + 2 (z-x)^TA^T(Ax-y)\\ & \geq \lVert Ax-y \rVert_2^2 + 2 (z-x)^TA^T(Ax-y) \end{align} Moreover, because $(AA^+)^T = AA^+$, $$ A^T(Ax-y) = ((AA^+)A)^Ty - A^Ty = 0$$ Thus, we prove that for all $z$, $\rVert Az-y \lVert_2^2 \geq\rVert Ax-y \lVert_2^2$, that is to say $A^+y$ is as close as possible to $y$ in term of the Euclidian norm $\lVert Ax-y\rVert_2$.

  2. Now, let us suppose that there exist $z$ so that $Az=y$. According to the first point, we have $\rVert Ax-y\lVert_2=0$, so $x$ is a solution. Moreover, for all solution $z$, $$ \lVert z \rVert_2^2=\lVert x \rVert_2^2 + \lVert z-x \rVert_2^2 + 2x^T(z-x)$$ Yet, because $A^+Ax=x$ and $(A^+A)^T=A^+A$, $$x^T(x-z) = (A^+Ax)^T(x-z) = x^T(A^+Ax-z) = x^T(A^+y-z)=0$$ Thus, $\lVert z \rVert_2^2 \geq \lVert x \rVert_2^2$, that is to say that $x$ is the solution with the minimal Euclidian norm.

Hyperplane
  • 11,659
  • Can you please explain how you concluded that $$\begin{align} \lVert Az-b \rVert_2^2 &= \lVert Ax-b \rVert_2^2 + \lVert A(z-x) \rVert_2^2 + 2 (z-x)^TA^T(Az-b) \end{align}$$? – The Pointer Feb 10 '20 at 04:31
  • My bad, last term should be $2 (z-x)^TA^T(Ax-b)$. I just edited the answer. – Etienne dM Feb 10 '20 at 13:28
  • I didn't even notice that. I'm not understanding how you got the entire result $$\begin{align} \lVert Az-b \rVert_2^2 &= \lVert Ax-b \rVert_2^2 + \lVert A(z-x) \rVert_2^2 + 2 (z-x)^TA^T(Ax-b) \end{align}$$ Can you please explain all of this? – The Pointer Mar 11 '20 at 16:12
  • 1
    @ThePointer It's because $$ \lVert Az-b\rVert_2^2 = \lVert A(z-x) +Ax-b\rVert_2^2 = \lVert A(z-x)\rVert_2^2 + \lVert Ax-b\rVert_2^2 + 2(z-x)^TA^T(Ax-b) $$ Using the fact that $$ \lVert x+y \rVert_2^2 = \lVert x \rVert_2^2 + \lVert y \rVert_2^2+ 2x^T y$$. – Etienne dM Mar 12 '20 at 17:11
  • Thanks for taking the time to clarify. Out of curiosity, what is the property $\vert \vert x + y \vert \vert_2^2 = \vert\vert x \vert\vert_2^2 + \vert\vert y \vert\vert_2^2 + 2x^T A^T y$ called? Where does it come from? – The Pointer Mar 12 '20 at 21:59
  • It comes from the fact that, by definition, $\lVert x \rVert_2^2 = x^Tx$, and from the bilinearity of the inner product : $$(x+y)^T(x+y) = x^Tx+y^Ty + 2x^Ty$$

    When $x^Ty = 0$, this equality is a generalization of the Pythagorean theorem for inner product spaces.

    – Etienne dM Mar 14 '20 at 06:22
  • I don't understand how any of what you wrote in 2) implies that "$A^+ b$ is as close as possible to $y$ in terms of the Euclidean norm $\lVert Ax-b\rVert_2$". – The Pointer Aug 03 '20 at 11:05
  • See our discussion here https://math.stackexchange.com/q/3778673/356308 . Just in trying to understand the first part of this proof, there seem to be numerous errors that prevent it from making sense. I'm guessing that this answer will require significant rewriting in order to make the proofs valid (hence the edits). Even so, it seems to be the most sensible answer posted so far, which is why it's probably worth salvaging with edits. – The Pointer Aug 03 '20 at 17:40
0

The answer to your first question follows easily by writing down the left inverse and the SVD of $A$ and $A^+$. When $A_{m\times n}$ has more columns than rows ($n>m$), it can be rewritten as $$A=UDV^T$$where $U_{m\times m}$ and $V_{n\times n}$ are unitary and $D_{m\times n}$ is diagonal. The Moore-Penrose pseudoinverse can be defined as $$A^+=VD^+U^T$$ where $D^+_{n\times m}$ is such that $$D^+D=\begin{bmatrix}I_{k\times k}&0_{k\times (n-k)}\\0_{(n-k)\times k}&0_{(n-k)\times (n-k)}\end{bmatrix}$$where $k\le m$ is the number of non-zero singular values of $A$ (For example if $D=\begin{bmatrix}2&0&0&0\\0&3&0&0\\0&0&0&0\end{bmatrix}$, then $k=2$ and $D^+=\begin{bmatrix}{1\over2}&0&0\\0&{1\over3}&0\\0&0&0\\0&0&0\end{bmatrix}$ which leads to $D^+D=\begin{bmatrix}1&0&0&0\\0&1&0&0\\0&0&0&0\\0&0&0&0\end{bmatrix}$). The system of variable then degrades to $$D^+DV^Tx=D^+U^Ty$$Since $||V^Tx||_2=||x||_2$ (rotation is an isometry), then by defining $w\triangleq V^Tx$ we can write$$D^+Dw=D^+U^Ty$$which induces constraints only on the first $k$ entries of $w$ (since only the first $k$ rows of $D^+D$ are linearly independent) and the remaining $n-k$ entries of $w$ are left without matter, such that if they are chosen equally zero, $w$ (and respectively $x$) touch their least possible $2$-norm (since $||x||_2^2=\sum_{i=1}^{n}|x_i|^2$).

To Be Updated...

Mostafa Ayaz
  • 31,924
  • Can you please explain a bit more about what the matrix $D^+_{n\times m}$ is? Also, can you please elaborate on how you got that $D^+DV^Tx=D^+U^Ty$? – The Pointer Feb 10 '20 at 04:43
  • Sure. Given an $m\times n$ diagonal matrix $D$ (i.e. for which we must have $a_{ij}=0$ when $i\ne j$), $D^+{n\times m}$ is another diagonal matrix whose diagonal entries are the real inverses of the counterpart entries in $D$, except for those that are zero. More formally$$D=\begin{bmatrix}A{k\times k}&0_{k\times n-k}\0_{m-k\times k}&0_{m-k\times n-k}\end{bmatrix}\implies D^+=\begin{bmatrix}A^{-1}{k\times k}&0{k\times m-k}\0_{n-k\times k}&0_{n-k\times m-k}\end{bmatrix}$$ – Mostafa Ayaz Feb 10 '20 at 08:30
  • For your second question, by starting from $$Ax=y$$we obtain the following $$UDV^Tx=y\implies DV^Tx=U^Ty\implies D^+DV^Tx=D^+U^Ty$$since $U^TU=UU^T=I$ – Mostafa Ayaz Feb 10 '20 at 08:30