Clustering, Mixed Data Set with Ordinal and Nominal Scale Data

Question

After reading a bit how categorical data can be considered in clustering, I came to the conclusion that most of the post do not make distinction between nominal scale data e.g. colour: red, green, blue, and ordinal scale data e.g. size: small, medium, large. However, the distances between the items make sense at the ordinal scale even if they are not necessarily the same between all items.

My questions:

Can I simply convert ordinal scale data to numeric scale without causing much trouble in clustering? I think yes, for the above reasons, but I would be pleased if you could confirm.
For nominal scale data, where the distances between the items make no sense, would be harder to capture. The easiest way I have found, if there are not too many items on the scale, was to break down the scale and add a variable for each item. E.g. originally, we have colour: red, green, blue, and we make variable colour_red, clour_green, and colour_blue, where each of them can take a value: 0 or 1. See the post form Jordan A on K-Means clustering for mixed numeric and categorical data. It seems to me a valid numeric scale of type ratio as it has a non-arbitrary zero value expressing the complete absence of something and 1 expressing the presence of something. Do you have experience with this in clustering? Is this a valid approach?

Otherwise I know I should use e.g. kproto (Kproto) for mixed data sets or kmodes (Kmodes) for plain, nominal data sets. Thank you for you responses.

score 1 · Accepted Answer · answered May 15 '19 at 12:44

Yes, it causes almost no trouble. The only caution you must have is the possibility of having a regular ordinality where the ordinality is actually irregular. For example, having shirt size 1, 2 and 3 but the size "3" is not three times bigger than size "1".
The dummy variable creation is a very valid approach when having categorical variables in your dataset, not only for clustering but for almost every model you could construct.

Clustering, Mixed Data Set with Ordinal and Nominal Scale Data

1 Answers1