1

After reading a bit how categorical data can be considered in clustering, I came to the conclusion that most of the post do not make distinction between nominal scale data e.g. colour: red, green, blue, and ordinal scale data e.g. size: small, medium, large. However, the distances between the items make sense at the ordinal scale even if they are not necessarily the same between all items.

My questions:

  1. Can I simply convert ordinal scale data to numeric scale without causing much trouble in clustering? I think yes, for the above reasons, but I would be pleased if you could confirm.
  2. For nominal scale data, where the distances between the items make no sense, would be harder to capture. The easiest way I have found, if there are not too many items on the scale, was to break down the scale and add a variable for each item. E.g. originally, we have colour: red, green, blue, and we make variable colour_red, clour_green, and colour_blue, where each of them can take a value: 0 or 1. See the post form Jordan A on K-Means clustering for mixed numeric and categorical data. It seems to me a valid numeric scale of type ratio as it has a non-arbitrary zero value expressing the complete absence of something and 1 expressing the presence of something. Do you have experience with this in clustering? Is this a valid approach?

Otherwise I know I should use e.g. kproto (Kproto) for mixed data sets or kmodes (Kmodes) for plain, nominal data sets. Thank you for you responses.

Tamas
  • 113
  • 4

1 Answers1

1
  1. Yes, it causes almost no trouble. The only caution you must have is the possibility of having a regular ordinality where the ordinality is actually irregular. For example, having shirt size 1, 2 and 3 but the size "3" is not three times bigger than size "1".

  2. The dummy variable creation is a very valid approach when having categorical variables in your dataset, not only for clustering but for almost every model you could construct.