In linear regression under the hypothesis $Y= \theta ^TX$, we want to minimize the mean square $ J(\theta) =\frac{1}{2}\sum \left(y^{(i)}-\theta ^TX^{(i)}\right)^2$, through algebraic deduction we get the normal equation $\theta = \left(X^TX\right)^{-1}X^TY$ such that the gradient of $J(\theta)$ is zero. But what if the inverse of $X^TX$ does not exist? Is there any theoretical results that can guarantee the existence of inverse? Or we can use pseudo-inverse to make a good approximation?
-
2Yes, you can use the Moore-Penrose pseudoinverse for the case of $\mathbf X$ not having full column rank. A good textbook on numerical linear algebra ought to have something on this. The usual route involves the use of the singular value decomposition. – J. M. ain't a mathematician Feb 03 '12 at 14:24
1 Answers
As in J.M.'s comment, there is no guarantee that the inverse exists but any pseudoinverse of $X^T X$ will minimize $J(\theta)$. The minimum is not unique in $\theta$ since different pseudoinverses will result in different $\theta$'s. I think the set of solutions should be an affine space of dimension equal to the lack of column rank in $X$. If $X$ is of full column rank then a unique minimizer exists. If instead of $\theta^T x_i$ you only look at $\hat Y = (\hat y_1, ..., \hat y_n)$ with the restriction that $\hat Y$ is in the column space of $X$ then $\hat Y$ given by $X(X^TX)^{-}X^TY$ is the unique minimizer (and is invariant to the choice of pseudoinverse).
As far as inference goes, different values of $\theta$ that minimize $J$ should all lead to the same inference in some well defined sense (in particular the identifiable parameter estimates stay the same ). The predicted values that the regression generates are the same, for example. The statistical interpretation of $X$ not being full rank is that we can perfectly predict the value of some predictor based on the others and so it is impossible to identify the effect of some subset of the predictors. Usually this is due to some structural relationship between the predictors, and it may or may not be useful to throw out some of the predictors and make $X$ of full column rank so that identifiability issues don't cloud things.

- 4,352