Amount of data needed and hypothesis for SVD

Question

I was looking into the definition of SVD and trying to understand which are the conditions needed to be met in order to be able to use it.

Is there any hypothesis concerning the distribution of the data that I want to apply SVD on ?
Is the condition (number of data vectors > the dimension of the vectors space) strict ? Or is there always a way to decompose a dataset using SVD ? In the definition of SVD, all websites that I could find start by the same condition without fearther explanitions :

given an $m\times n$ real matrix A with $n>m$, A can be written as ...

1: The data shouldn't be random, otherwise you can't predict. There should be some clustering. Each column element should be somewhat accurately described by several 'features' i.e.: 1 ≤ features ≤ m values. 2: You can transpose your matrix to satisfy this condition (I don't see what would go wrong if n=m). — Swier, Apr 21 '16 at 09:15

David Dale · Answer 1 · 2017-12-23T10:06:19.833

SVD is operation of decomposing a matrix $M$ into the matrix product $M=U\Lambda V$, where $U$ and $V$ are unitary matrices ($U^TU=UU^T=I$, etc.), and $\Lambda$ is diagonal.

So the formal answer is:

No, there are no assumptions about distribution of $M$. It does not to be random at all, it is just a matrix of any nature.
Condition $n>m$, where $(n, m)$ is shape of $M$, is not necessary. Indeed, if you have $n>m$, you can transpose the equation $M=U\Lambda V$ to get $M^T=V^T\Lambda U^T$, and it will be also a valid SVD decomposition.

Now the less formal part.

In data mining SVD has two main applications: for computations (like matrix inversion, least squares, etc.), and for dimensionality reduction (e.g. to compress the user-item matrix in recommender systems). For computations, only algebraic properties of SVD (shown above) matter.

For dimensionality reduction, truncated SVD is used: the small elements of $\Lambda$ are discarded, and only $k$ largest are kept. This operation is equivalent to finding a $k$-dimensional hyperplane such that projection of the data ($M$) on this hyperplane is closest (in Euclidian distance) to the original data.

In the last case, the analyst would like the decomposition to generalize well to unseen data, and this problem can be formulated well in probabilistic language. If we state that $M$ consist of $n$ IID $m$-dimensional random variables, then it appears that SVD works best if they are joint normal. That's because:

a. multivariate normal is indeed distributed nearly a hyperplane, and

b. for multivariate normal, Euclidean distance is tightly connected with probability density. Therefore truncated SVD (or PCA, which is mathematically identical) can be viewed as maximum likelihood estiomation of multivariate normal distribution with $k$ independent components. For more details, see the article by Bishop.

Amount of data needed and hypothesis for SVD

1 Answers1