2

I am trying to do clustering with a bunch (24) of categorical features. I have done some research and found a lot of people recommending something such as K-Modes. I tried running K-Modes on my data and the best run had a cost of 27069.0, which seems pretty high.

Some of my features have only a few values, such as P, O, C, T, so I thought I could encode them. But others have many different values. Any tips on a clustering algorithm or some other approach? I would like to use Python.

EDIT: What about using Gower distance on the data and then using K-Means on that?

formicaman
  • 141
  • 2
  • https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data Why does your cost seem high? there is no a priori meaning to distance anyway – Sean Owen Mar 16 '20 at 13:55

1 Answers1

1

You can one-hot encode all your features, first. Then, you will face with a sparse feature space. To resolve this issue, you can use an auto-encoder to encode all these values to a low-dimensional and more dense space. Then run one of your clustering methods such as k-means.

OmG
  • 1,219
  • 9
  • 19
  • Thanks! Any recommendations on how to do the auto-encoding part? – formicaman Mar 16 '20 at 18:48
  • What if I use OneHotEncoder and then my resulting matrix is not sparse? – formicaman Mar 18 '20 at 17:09
  • @formicaman as you described the data, after the encoding, it will be sparse. For the auto encoding, use an auto encoder : ) – OmG Mar 18 '20 at 17:22
  • I used 'scipy.sparse.issparse(X_encoded)' and it returned 'False'. But I am looking at https://blog.keras.io/building-autoencoders-in-keras.html for auto-encoding. – formicaman Mar 18 '20 at 17:47
  • @formicaman oops! You are in the wrong way! scipy.sparse.issparse is for the type of the variable, not about the concept of matrix sparsity! – OmG Mar 18 '20 at 18:08
  • For autoencoding, the idea is to have 3 layers: the input layer consisting of all of your features, then a layer with smaller number of units, and then try to recreate the input layer again in the 3rd layer. After training this model, you take the representation of your data in the 2nd layer and it is a "compressed" form. Honestly that is a really overkill in my opinion, why can't you just do PCA or SVD if you are concerned with sparsity – Corey Levinson Apr 15 '20 at 21:53