7

I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.

For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.

My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.

Eva
  • 81
  • 4

3 Answers3

8

You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with categorical variables. Check out the answers to this similar question.

You can use the following rules for performing clustering with k-means or one of its derivates:

If your data contains only metric variables:

Scale the data and use k-means (R) (Python).

If your data contains only categorical variables:

Use k-modes (R) (Python).

If your data contains categorical and metric variables:

Scale the metric variables and use k-prototypes (R) (Python).

georg-un
  • 1,231
  • 9
  • 21
3

Clearly the objective function uses a sum over the features.

So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.

However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.

Has QUIT--Anony-Mousse
  • 7,999
  • 1
  • 14
  • 31
3

You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variations\modifications of the basic ones. Please check the following paper:

"Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.