4

The dataset that I am experimenting with is in the form of a table with columns userid and itemid. If there is a row for a given user and a given item, that means the user accessed the item (like in an online store). I am trying to cluster similar items based on this data. If a pair of items is accessed together often, then the items are similar.

Because this is a case of a high dimensionality (# of users and items will be in 10,000's) I think I am justified in trying to use SVD as a pre-clustering step and then do some classical clustering. When I tried doing this I got poor clustering results when compared with simple hierarchical clustering. Items that weren't very similar were being bucketed together in one dimension, while there were available dimensions that weren't used. The results weren't completely random, but they were definitely worse than the output from the hierarchical clustering. I attempted the SVD step with Mahaut and Octave and the results were similar. For the hierarchical clustering I used the Jaccard measure.

At this point I am starting to doubt the notion of SVD as a way to reduce dimensionality. Do you think that SVD cannot be used effectively in this case (and why?) or do you think that I made some mistake along the way?

Rubens
  • 4,107
  • 5
  • 23
  • 42
rbk
  • 41
  • 1
  • 2
  • Do you use every userid and every itemid as a separate dimension? – ffriend Oct 02 '14 at 06:42
  • When comparing items the number of dimensions is the number of users. Think of items as vectors of 1's and 0's where each dimension corresponds to some user and the value is 1 if the user accessed the item and 0 otherwise. – rbk Oct 02 '14 at 13:14
  • The real question here is: does standard PCA work well with binary variables? And I'd say that the answer is "no". Binary variables are not continuous (you can't get .73 in your user-item matrix), but instead categorical (only 0s and 1s are allowed, "yes"-s and "no"-s and nothing in between). IIRC, MCA is a standard analogue of PCA for categorical data. Though, my personal approach would be to use RBMs, which can also handle non-linearities. – ffriend Oct 02 '14 at 21:14
  • It is possible that there are algorithms more suited to categorical data than SVD. However, SVD is often cited as the tool for dimensional reduction in the context of latent semantic analysis. See the response by buruzaemon, for example. In that case the SVD is applied to the term incidence matrix which is also made of 1's and 0's. Also some form of matrix factorization (may be SVD with some regularization) was successfully used in the Netflix competition to predict user ratings for movies (the ratings are integers 1-5). Therefore, I don't think it is easy to write SVD off in discrete case. – rbk Oct 03 '14 at 00:55
  • Re using RBMs: Aren't RBMs for classification (supervised learning)? I am interested in clustering (unsupervised learning). Anyway RBM's are probably off limits for me because RBMs are not implemented in Mahout and I cannot use anything licensed by GPL for legal reasons. – rbk Oct 03 '14 at 00:59
  • In LSA, SVD is not used for dimension reduction, but instead for finding semantic concepts. If you want dimension reduction, use PCA (or MCA). 2. For continuous variables it makes sense to talk about real values (e.g. .73) and order of values, for categorical - no. For Netflix it makes sense to say "user likes X as much as 4.5" and "mark 4.5 is higher than 3.88". For values "man" and "woman" (even represented as 1 and 0) there's nothing between them (e.g. what would mean value .73?) and no specific order ("man" > "woman" or vice versa?). Decide for yourself which case is yours.
  • – ffriend Oct 03 '14 at 15:40
  • RBM may be used for data compression/dimension reduction, see relevant sections of my earlier answer
  • – ffriend Oct 03 '14 at 15:44
  • If every item is assigned some semantic concepts, then that can be used for clustering. I believe that my case is analogous to the latent semantic analysis and the netflix data set because in all examples in the absence of data the default value in the matrix is zero. For netlix: zero doesn't mean that the user gave that movie a zero rating, it means that he didn't give any rating. Therefore netflix rating is also a categorical value. In fact I would experiment with the netflix dataset the same way by mapping all ratings to 1 and mapping absence of a rating to zero. – rbk Oct 05 '14 at 15:31
  • Re RBMs: This is interesting. Thank you for sharing. – rbk Oct 05 '14 at 15:32
  • In Netflix 0 is just an encoding for NA (not available) values. It's not a continuous or categorical variable, it's just missing datum, and should be treated separately. E.g. you want your model to predict values from 1 to 5, but not zero - it would be nonsense (how even would you interpret it? predicting that user haven't given a rating?). So your case is somewhat similar to LSA, but definitely not Netflix dataset. – ffriend Oct 05 '14 at 22:05
  • I agree it isn't very appropriate to mix NAs with ratings. However, the data set is sparse--no user rates all movies. If we want to apply SVD to the dataset and exclude NA values, we would be applying it to an empty set or we will be narrowing the dataset down extremely to a subset of users and a subset of movies, so that each movie is rated by each user. So NAs have to be assigned some values to perform SVD at all. In this paper (http://research.cs.queensu.ca/TechReports/Reports/2006-527.pdf) the authors treated NAs as no-opinion with score 3 (see page 11). Sounds like it worked for them. – rbk Oct 07 '14 at 00:10
  • Often people use mean of all ratings for specific film. E.g. if movie got ratings (2, 4, 2) from 3 users, all other user ratings (which are NAs in the dataset) are assigned (2+4+2)/3 = 2.66. – ffriend Oct 07 '14 at 06:21
  • If a user bothered to rate the item, then that means that the user is somewhat interested in this item. I assume that users don't access items at random, but do so according to their personal tastes. Missing data is interpreted as negative feedback. Why would it be more appropriate to give mean rating for all users? The mean would be an estimator only conditionally that the users provide some rating, because this characterizes the sample space. It seems that the mean is appropriate for different purposes than mine, like modeling the rating function with the assumption that a rating is given. – rbk Oct 07 '14 at 23:28
  • It's up to your assumptions. There are ways to assess performance of both models, so you can always check what assumption was closer to the truth. – ffriend Oct 08 '14 at 08:37