9

Let $X \in R^{m\times n}$ with $m>n$. We aim to solve $y=X\beta$ where $\hat\beta$ is the least square estimator. The least squares solution for $\hat\beta = (X^TX)^{-1}X^Ty$ can be obtained using QR decomposition on $X$ and $LU$ decomposition on $X^TX$. The aim to compare these.

I noticed that we can use Cholesky decomposition instead of $LU$, since $X^TX$ is symmetric and positive definite.

Using $LU$ we have:

$\hat\beta = (X^TX)^{-1}X^Ty=(LU)^{-1}X^Ty$, solve $a=X^Ty$ which is order $O(2nm)$, then $L^{-1}b=a$ at cost $\sum_1^{k=n} (2k-1)$ and finally $U^{-1}a$ at the same cost of $\sum_1^{k=n} (2k-1)$.

I didn't count the cost of computing $L^{-1}$ and $U^{-1}$.

Using $QR$ we have: $\hat\beta = (X^TX)^{-1}X^Ty=((QR)^TQR)^{-1}R^TQ^Ty=R^{-1}Q^Ty$, where we solve $Q^Ty=a$ at cost $O(n^2)$ and $R^{-1}a$ with cost $\sum_1^{k=n} (2k-1)$.

Comparing the decompositions: It seems that QR decomposition is much better than LU. I think the cost of computing QR is higher than LU, which is why we could prefer to use LU. On the other hand if we are given the decompositions, we should use QR.

$SVD$ decomposition: Is there any advantage to use SVD decomposition?

GRS
  • 2,495
  • why does it seem that QR is much better? $\sum _1 ^n (2k-1) = \mathcal{O} (n^2)$. So both are $\mathcal{O} (n^2)$ – Paul Dec 02 '16 at 13:48
  • The exact number must be bigger, since $2nm>n^2$ and we are adding 2 summs for $LU$ instead of 1 – GRS Dec 02 '16 at 13:51
  • @user251257 I think it's the opposite. LU is twice as expensive: $O(2nm)+2O(n^2)$ as opposed to QR: $O(n^2)+O(n^2)$ – GRS Dec 02 '16 at 14:00
  • Which size is $Q$? – user251257 Dec 02 '16 at 14:02
  • Also the way you come up with $R^{-1} Q^T y$ isn't sound. – user251257 Dec 02 '16 at 14:03
  • I do unterstand your argument now. Sorry. In your case, $LU$ is indeed about $n^2/2$ more expansiv. However, computing $QR$ costs about 3 times the cost of $LU$... and the costs are cubic in both cases and shouldn't be ignored. – user251257 Dec 02 '16 at 14:07
  • I'm assuming $Q$ is $m$ by $m$ and $R$ is $m$ by $n$, is this correct? – GRS Dec 02 '16 at 14:07
  • @GRS how do you invert $R$ then? – user251257 Dec 02 '16 at 14:08
  • I'm sorry, I didn't see the $\mathcal{O}(mn)$. By the way terms like "twice as expensive" do not make a lot of sense when using big-O notation since $\mathcal{O}(n) = \mathcal{O}(2n) = \mathcal{O}(n) + \mathcal{O}(n)$ – Paul Dec 02 '16 at 14:09
  • Then $Q$ must be $m$by $n$ and $R$ is $n$ by $n$, but I don't compute the costs involved in computing inverses, I don't see why I need to decompose. I could have just used $X$ matrix. I would get similar costs. – GRS Dec 02 '16 at 14:10
  • @GRS $X$ doesn't have a inverse, do it... – user251257 Dec 02 '16 at 14:11
  • But $X^TX$ does, which I would compute similarly to computing $R^{-1}$ – GRS Dec 02 '16 at 14:12
  • What do you think is the cost to compute $X^T X$ and its inverse? – user251257 Dec 02 '16 at 14:13
  • $X^TX$ is $O(2n^2m)$, but the inverse should be more than $R^{-1}$, but I don't know by how much. I'd assume it's the same order – GRS Dec 02 '16 at 14:16
  • @GRS inverting a full matrix of order $n$ is cubic in $n$ without sophistic divide and conquer algorithms. – user251257 Dec 02 '16 at 14:39
  • This would imply that both of them are the same, and the total would be cubic for both $QR$ and without $QR$, which makes me wonder why we should decompose in the first place? – GRS Dec 02 '16 at 14:41
  • @GRS we rarely need the inverse, computing the inverse may be unstable for ill conditioned matrix, and it is usually computed using a decomposition. – user251257 Dec 02 '16 at 16:33
  • You should really look at a numerical linear algebra book like Trefethen and Bau; The numerical stability is often of key concern when doing least squares. It is a big reason why you use things like SVD. – Batman Dec 02 '16 at 23:24
  • Small correction: $X^tX$ is not necessarily positive definite, only positive semidefinite. – flawr Mar 10 '17 at 22:26
  • Don't ignore the cost of computing $X^T X$ for the $LU$ case, which can be quite expensive. For $QR$, you don't need to compute $X^T X$, you can directly decompose $X = QR$ and solve $R \beta = Q^T y$. I've never heard of anyone using $LU$ for a least-squares problem in general. If you go to the trouble of computing $X^T X$, then you might as well use a Cholesky decomposition at that point. If you are worred about a singular matrix, you can use a modified Cholesky decomposition. Generally people prefer $QR$ for modest sized systems due to numerical stability. – vibe Aug 23 '18 at 03:57

2 Answers2

6

Matrix Computations

Golub and Van Loan, 3e

$\S$ 5.7.1, p. 270

Comparing flop counts for operations on $n\times n$ matrices:

$\frac{2}{3}n^{3}\ $ $\qquad$ Gaussian elimination

$\frac{4}{3}n^{3}\ $ $\qquad$ Householder orthogonalization

$2n^{3}$ $\qquad \ \ $ Modified Gram-Schmidt

$\frac{8}{3}n^{3}\ $ $\qquad$ Bidiagonalization

$12n^{3}$ $\qquad$ Singular Value Decomposition

Three reasons to choose orthogonalization to solve square systems:

  1. Flop counts exaggerate the Gaussian elimination advantage. When memory traffic and vectorization overheads are considered the $\mathbf{Q}\mathbf{R}$ factorization is comparable in efficiency.

  2. Orthogonalization methods have guaranteed stability, there is no "growth factor" to worry about as in Gaussian elimination.

  3. In cases of ill-conditioning, the orthogonalization methods give an added measure of reliability. $\mathbf{Q}\mathbf{R}$ with condition estimate is very dependable and, of course, SVD is unsurpassed when it comes to producing a meaningful solution to a nearly singular system.

dantopa
  • 10,342
5

Your reasoning at the top is really odd. The LU decomposition is twice as fast as the standard QR decomposition and it will solve most systems. There are however pathologically ill-conditioned systems and non-square systems. That is where it will use the QR or SVD. The main reason for the SVD is it allows you to be selective about your condition number.

There are many other decompositions. The Cholesky decomposition is twice as fast as the LU decomposition but only for positive definite Hermitian matrices. All of this neglects the sparsity of the matrix as well.