Using SVD for clustering

Question

The dataset that I am experimenting with is in the form of a table with columns userid and itemid. If there is a row for a given user and a given item, that means the user accessed the item (like in an online store). I am trying to cluster similar items based on this data. If a pair of items is accessed together often, then the items are similar.

Because this is a case of a high dimensionality (# of users and items will be in 10,000's) I think I am justified in trying to use SVD as a pre-clustering step and then do some classical clustering. When I tried doing this I got poor clustering results when compared with simple hierarchical clustering. Items that weren't very similar were being bucketed together in one dimension, while there were available dimensions that weren't used. The results weren't completely random, but they were definitely worse than the output from the hierarchical clustering. I attempted the SVD step with Mahaut and Octave and the results were similar. For the hierarchical clustering I used the Jaccard measure.

At this point I am starting to doubt the notion of SVD as a way to reduce dimensionality. Do you think that SVD cannot be used effectively in this case (and why?) or do you think that I made some mistake along the way?

Do you use every userid and every itemid as a separate dimension? — ffriend, Oct 02 '14 at 06:42
When comparing items the number of dimensions is the number of users. Think of items as vectors of 1's and 0's where each dimension corresponds to some user and the value is 1 if the user accessed the item and 0 otherwise. — rbk, Oct 02 '14 at 13:14
The real question here is: does standard PCA work well with binary variables? And I'd say that the answer is "no". Binary variables are not continuous (you can't get .73 in your user-item matrix), but instead categorical (only 0s and 1s are allowed, "yes"-s and "no"-s and nothing in between). IIRC, MCA is a standard analogue of PCA for categorical data. Though, my personal approach would be to use RBMs, which can also handle non-linearities. — ffriend, Oct 02 '14 at 21:14
It is possible that there are algorithms more suited to categorical data than SVD. However, SVD is often cited as the tool for dimensional reduction in the context of latent semantic analysis. See the response by buruzaemon, for example. In that case the SVD is applied to the term incidence matrix which is also made of 1's and 0's. Also some form of matrix factorization (may be SVD with some regularization) was successfully used in the Netflix competition to predict user ratings for movies (the ratings are integers 1-5). Therefore, I don't think it is easy to write SVD off in discrete case. — rbk, Oct 03 '14 at 00:55
Re using RBMs: Aren't RBMs for classification (supervised learning)? I am interested in clustering (unsupervised learning). Anyway RBM's are probably off limits for me because RBMs are not implemented in Mahout and I cannot use anything licensed by GPL for legal reasons. — rbk, Oct 03 '14 at 00:59

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

We are using Singular Value Decomposition in the much same manner as you, except rather than clustering similar items, we are using a reduced-rank matrix to power a recommendation engine based on a term-document matrix in Latent Semantic Indexing.

From your brief description, your approach seems sound enough. However, I highly recommend reading Berry, Dumais & O'Brien's Using Linear Algebra for Intelligent Information Retrieval.

Key to using SVD is selecting an acceptable rank-k approximation to the original sparse matrix. You should carry out some exploratory analysis to see how much variance can be explained, using the singular values in the diagonal matrix Sigma. This question was brought up in this question on CrossValidated.

A lot of the papers I've read suggest anywhere a rank k from 200 to 300 singular values. In a proof-of-concept implementation, we had original sparse matrix of about 10000 rows (unique terms) to about 1000 columns (unique documents), and we were capturing just under 85% of the variance with only 300 singular values.

However, that really hinges upon the nature of your data, so your mileage may vary.

I have tried varying the rank for the approximation. I tried ranks 40, 100, 300, 1000. I expect that with the rank 40 it would classify items into 40 available dimensions. If that classification into 40 classes is good, I would have no issue with that. However, it seems that some dimensions were simply not being used. A lot of dissimilar items ended up in the same dimension. — rbk, Oct 02 '14 at 13:29

Using SVD for clustering

1 Answers1