Comparing LU or QR decompositions for solving least squares

Question

Let $X \in R^{m\times n}$ with $m>n$. We aim to solve $y=X\beta$ where $\hat\beta$ is the least square estimator. The least squares solution for $\hat\beta = (X^TX)^{-1}X^Ty$ can be obtained using QR decomposition on $X$ and $LU$ decomposition on $X^TX$. The aim to compare these.

I noticed that we can use Cholesky decomposition instead of $LU$, since $X^TX$ is symmetric and positive definite.

Using $LU$ we have:

$\hat\beta = (X^TX)^{-1}X^Ty=(LU)^{-1}X^Ty$, solve $a=X^Ty$ which is order $O(2nm)$, then $L^{-1}b=a$ at cost $\sum_1^{k=n} (2k-1)$ and finally $U^{-1}a$ at the same cost of $\sum_1^{k=n} (2k-1)$.

I didn't count the cost of computing $L^{-1}$ and $U^{-1}$.

Using $QR$ we have: $\hat\beta = (X^TX)^{-1}X^Ty=((QR)^TQR)^{-1}R^TQ^Ty=R^{-1}Q^Ty$, where we solve $Q^Ty=a$ at cost $O(n^2)$ and $R^{-1}a$ with cost $\sum_1^{k=n} (2k-1)$.

Comparing the decompositions: It seems that QR decomposition is much better than LU. I think the cost of computing QR is higher than LU, which is why we could prefer to use LU. On the other hand if we are given the decompositions, we should use QR.

$SVD$ decomposition: Is there any advantage to use SVD decomposition?

why does it seem that QR is much better? $\sum _1 ^n (2k-1) = \mathcal{O} (n^2)$. So both are $\mathcal{O} (n^2)$ — Paul, Dec 02 '16 at 13:48
The exact number must be bigger, since $2nm>n^2$ and we are adding 2 summs for $LU$ instead of 1 — GRS, Dec 02 '16 at 13:51
@user251257 I think it's the opposite. LU is twice as expensive: $O(2nm)+2O(n^2)$ as opposed to QR: $O(n^2)+O(n^2)$ — GRS, Dec 02 '16 at 14:00
I do unterstand your argument now. Sorry. In your case, $LU$ is indeed about $n^2/2$ more expansiv. However, computing $QR$ costs about 3 times the cost of $LU$... and the costs are cubic in both cases and shouldn't be ignored. — user251257, Dec 02 '16 at 14:07
I'm assuming $Q$ is $m$ by $m$ and $R$ is $m$ by $n$, is this correct? — GRS, Dec 02 '16 at 14:07
I'm sorry, I didn't see the $\mathcal{O}(mn)$. By the way terms like "twice as expensive" do not make a lot of sense when using big-O notation since $\mathcal{O}(n) = \mathcal{O}(2n) = \mathcal{O}(n) + \mathcal{O}(n)$ — Paul, Dec 02 '16 at 14:09
Then $Q$ must be $m$by $n$ and $R$ is $n$ by $n$, but I don't compute the costs involved in computing inverses, I don't see why I need to decompose. I could have just used $X$ matrix. I would get similar costs. — GRS, Dec 02 '16 at 14:10
But $X^TX$ does, which I would compute similarly to computing $R^{-1}$ — GRS, Dec 02 '16 at 14:12
What do you think is the cost to compute $X^T X$ and its inverse? — user251257, Dec 02 '16 at 14:13
$X^TX$ is $O(2n^2m)$, but the inverse should be more than $R^{-1}$, but I don't know by how much. I'd assume it's the same order — GRS, Dec 02 '16 at 14:16
@GRS inverting a full matrix of order $n$ is cubic in $n$ without sophistic divide and conquer algorithms. — user251257, Dec 02 '16 at 14:39
This would imply that both of them are the same, and the total would be cubic for both $QR$ and without $QR$, which makes me wonder why we should decompose in the first place? — GRS, Dec 02 '16 at 14:41
@GRS we rarely need the inverse, computing the inverse may be unstable for ill conditioned matrix, and it is usually computed using a decomposition. — user251257, Dec 02 '16 at 16:33
You should really look at a numerical linear algebra book like Trefethen and Bau; The numerical stability is often of key concern when doing least squares. It is a big reason why you use things like SVD. — Batman, Dec 02 '16 at 23:24
Small correction: $X^tX$ is not necessarily positive definite, only positive semidefinite. — flawr, Mar 10 '17 at 22:26
Don't ignore the cost of computing $X^T X$ for the $LU$ case, which can be quite expensive. For $QR$, you don't need to compute $X^T X$, you can directly decompose $X = QR$ and solve $R \beta = Q^T y$. I've never heard of anyone using $LU$ for a least-squares problem in general. If you go to the trouble of computing $X^T X$, then you might as well use a Cholesky decomposition at that point. If you are worred about a singular matrix, you can use a modified Cholesky decomposition. Generally people prefer $QR$ for modest sized systems due to numerical stability. — vibe, Aug 23 '18 at 03:57

score 6 · Accepted Answer · answered Mar 10 '17 at 22:19

Matrix Computations

Golub and Van Loan, 3e

$\S$ 5.7.1, p. 270

Comparing flop counts for operations on $n\times n$ matrices:

$\frac{2}{3}n^{3}\ $ $\qquad$ Gaussian elimination

$\frac{4}{3}n^{3}\ $ $\qquad$ Householder orthogonalization

$2n^{3}$ $\qquad \ \ $ Modified Gram-Schmidt

$\frac{8}{3}n^{3}\ $ $\qquad$ Bidiagonalization

$12n^{3}$ $\qquad$ Singular Value Decomposition

Three reasons to choose orthogonalization to solve square systems:

Flop counts exaggerate the Gaussian elimination advantage. When memory traffic and vectorization overheads are considered the $\mathbf{Q}\mathbf{R}$ factorization is comparable in efficiency.
Orthogonalization methods have guaranteed stability, there is no "growth factor" to worry about as in Gaussian elimination.
In cases of ill-conditioning, the orthogonalization methods give an added measure of reliability. $\mathbf{Q}\mathbf{R}$ with condition estimate is very dependable and, of course, SVD is unsurpassed when it comes to producing a meaningful solution to a nearly singular system.

score 5 · Answer 2 · answered Sep 17 '18 at 18:57

Your reasoning at the top is really odd. The LU decomposition is twice as fast as the standard QR decomposition and it will solve most systems. There are however pathologically ill-conditioned systems and non-square systems. That is where it will use the QR or SVD. The main reason for the SVD is it allows you to be selective about your condition number.

There are many other decompositions. The Cholesky decomposition is twice as fast as the LU decomposition but only for positive definite Hermitian matrices. All of this neglects the sparsity of the matrix as well.

Comparing LU or QR decompositions for solving least squares

2 Answers2

Linked