In least-squares, what if ${\bf X}^\top {\bf X}$ is not invertible?

Question

In least-squares, when solving the normal equation, we calculate the inverse of ${\bf X}^\top {\bf X}$. What if this matrix is not invertible?

@5xum - Example. Man I am not so high into Linear algebra. But , what i am wondering like , what if there a matrix X for which calculating inverse (X transpose X) is impossible. Simply , what if X is not invertible ? That means, we cannot apply Least square matrix form equation to solve it. right ? — Sarath R Nair, May 13 '15 at 12:08
If $X^TX$ is not invertible, the least squares solution is not unique; there is an infinite number of solutions minimizing the (Euclidean) norm of the residual. Anyway, among them, there is always a unique solution having the minimal norm, which is provided by the Moore-Penrose pseudoinverse. — Algebraic Pavel, May 13 '15 at 12:16
@AlgebraicPavel - Exactly. This was I looking for. So, the positive definite property will also come into consideration here right ? — Sarath R Nair, May 13 '15 at 12:24

score 2 · Answer 1 · answered Mar 10 '17 at 22:54

Start with the linear system $$ \mathbf{A}x = b $$ with $$ \mathbf{A} \in \mathbb{C}^{m\times n}_{\rho}, \quad x \in \mathbb{C}^{n}, \quad b \in \mathbb{C}^{m}. $$ If the data vector $b$ is in the image of $\mathbf{A}$, then there is a solution vector $x$ such that $$ \mathbf{A}x - b = 0. $$ If the data vector $b$ has a nullspace component, we can not get an exact answer, and instead ask for the best answer.

Choosing the $2-$norm, the least squares minimizers are defined as $$ x_{LS} = \left\{ x\in\mathbb{C}^{n} \colon \lVert \mathbf{A} x_{LS} - b \rVert_{2}^{2} \text{ is minimized} \right\}. $$ Every matrix has a singular value decomposition $$ \mathbf{A} = \mathbf{U}\, \Sigma\, \mathbf{V}^{*} $$ which allows us to express the Moore-Penrose pseudoinverse $$ \mathbf{A}^{\dagger} = \mathbf{V}\, \Sigma^{\dagger}\, \mathbf{U}^{*} $$ which can be used to pose the general solution to the least squares problem: $$ x_{LS} = \mathbf{A}^{\dagger} b + \left( \mathbf{I}_{n} - \mathbf{A}^{\dagger}\mathbf{A} \right) y, \quad y\in\mathbb{C}^{n} $$ The geometry of the solution is discussed in Is the unique least norm solution to Ax=b the orthogonal projection of b onto R(A)?

More on the SVD here: How does the SVD solve the least squares problem?

When the classic inverse exists, it is the pseudoinverse. In this case, there are no nullspace components and $\Sigma=\mathbf{S}$ and we can see that $$ \begin{align} % \mathbf{A}\mathbf{A}^{-1} &= \mathbf{A}\mathbf{A}^{\dagger} = \left( \mathbf{V}\, \mathbf{S}\, \mathbf{U}^{*}\right) \left( \mathbf{U}\, \mathbf{S}^{-1}\, \mathbf{V}^{*}\right) = \mathbf{I} \\ % \mathbf{A}^{-1}\mathbf{A} &= \mathbf{A}^{\dagger}\mathbf{A} = \left( \mathbf{U}\, \mathbf{S}^{-1}\, \mathbf{V}^{*}\right) \left( \mathbf{V}\, \mathbf{S}\, \mathbf{U}^{*}\right) = \mathbf{I} % \end{align} $$

The normal equations solution is discussed here: Difference between orthogonal projection and least squares solution

Additional discussion here: Least squares and pseudo-inverse

score 0 · Answer 2 · answered Mar 29 '23 at 16:17

Here is a birds-eye view of the linear regression problem.

You assume that $y$ is a vector in $\mathbf{R}^n$ representing the observations of $n$ subjects; you also have $X$ a matrix of size $(n,p),$ where $p$ is the number of "features" or "explanatory variables." Each row of $X$ correspond to the $p$ features for a given subject, and the rows of $X$ are ordered likewise the rows of $y,$ interpretation being that the $i$th entry of $y$ is the criterion observed for the $i$th subject, and that the $i$th row of $X$ are the $p$ features of this subject. Additionally, it is usually assumed that $y$ is "hard" to obtain while $X$ is easy (for example, $y$ could be like sugar in blood for which blood extraction, and laboratory analysis is required, while $X$ may be severeal features including age, weight, height, income, etc.). This assumption is to make the model sensible from an applied perspective and it bears no mathematical meaning.

The idea behind linear regression is then to view $y$ as a vector in $\mathbf{R}^n$ and want to obtain the "best description of $y$ using the features of $X$", which can be interpreted as orthogonally projecting $y$ onto $V_X = \mathrm{im}(X)$ (the image, span or column space of $X$). A well known result of linear algebra is that given a vector subspace $V$ of $\mathbf{R}^n$ and a vector $y,$ there exists one, and solely one vector $v \in V$ such that $y = v + (y - v)$ satisfying $y - v \perp V.$ When we apply this theoretical result to $V_X,$ we can describe $y$ in a unique way as $y = y_X + (y - y_X),$ with $y_X \in V_X$ and $y_X^\perp := y - y_X \perp V_X.$ Then $y_X$ is known as the "linear regression of $y$ onto $X.$" (It turns out that $y_X$ is the orthogonal projection of $y$ onto the space spanned by $X.$) This description of $y$ is purely dependent on the vector space structure together with the orthogonal structure and does not depend upon on the particular use of $X.$ More explicitly, if $X_1$ and $X_2$ are two matrices of measured features such that $\mathrm{im}(X_1) = \mathrm{im}(X_2) = V_X,$ then the regression of $y$ onto $X_1$ and the regression of $y$ onto $X_2$ are the same and coincide with the regression of $y$ onto $X.$ Finally, when $X$ is an invertible matrix, it can be shown that the orthogonal projector onto $V_X$ is $X(X^\intercal X)^{-1} X^\intercal.$ In fact, let me prove this for you.

Theorem. Let $X$ be an invertible matrix. Then the orthogonal projector onto $V_X$ is the matrix $X(X^\intercal X)^{-1} X^\intercal.$

Proof. Suppose first that $X$ has orthonormal columns, this means that $X^\intercal X = I_p.$ Call the columns of $X$ as $x_1, \ldots, x_p,$ and complete them to an orthonormal basis of $\mathbf{R}^n,$ calling the whole basis $x_1, \ldots, x_n.$ Then, every vector in $\mathbf{R}^n$ can be written $$ y = (x_1^\intercal y) x_1 + \ldots + (x_n^\intercal y) x_n, $$ and therefore $y_X = (x_1^\intercal y) x_1 + \ldots + (x_p^\intercal y) x_p = X v,$ where $v = [x_1^\intercal y, \ldots, x_p^\intercal y]^\intercal = X^\intercal y.$ Thus, the formula $y_X = XX^\intercal y$ emerges.

Suppose now that $X$ is just invertible. Gram-Schmidt orthonormalisation process shows that $XT = X_0$ where $X_0$ is a matrix with orthonormal columns and $T$ is a square invertible matrix of order $p.$ Then, $y_X = X_0X_0^\intercal y = XTT^\intercal X$ but $X^\intercal X = (X_0 T^{-1})^\intercal (X_0 T^{-1}) = T^{-\intercal} X_0^\intercal X_0 T^{-1} = T^{-\intercal} T^{-1},$ and after taking inverse, we see $y_X = X(X^\intercal X)^{-1}X^\intercal y,$ showing the desired result. QED

Note. It can be show that if we don't have invertibility of $X$ we still have the formula $X(X^\intercal X)^- X^\intercal$ where now $M^-$ denote a generalised inverse (the Moore-Penrose being the favoured choice) of $M;$ $M^-$ is not well-defined as there are multiple generalised inversed, however $X(X^\intercal X)^-X$ is independent of $X$ and of the choice of generalised inverse.

In least-squares, what if ${\bf X}^\top {\bf X}$ is not invertible?

2 Answers2