Kmeans on mixed dataset with high level for categ

Question

My retail dataset contains 3 numeric attributes and two categorical attributes Time and ID with 50,000 records. Both categorical attributes have more than 20 thousand levels and their format is 1/11/2011 11:54 and 1TD10051 respectively.

How do I do kmeans on these dataset? Converting categorical to binary will give very sparse dataset?

How to proceed?

Is time really a categorical attribute ? Perhaps day of the week or month might be? — image_doctor, Jun 30 '15 at 10:46
Value under time attribute appearing as 1/11/2011 11.54 and showing as Factor/W 20823 labels — joy, Jun 30 '15 at 12:03
My meaning was to suggest that interpreting time as a category, is probably not useful in a machine learning context. — image_doctor, Jun 30 '15 at 12:37

score 2 · Answer 1 · edited Apr 13 '17 at 12:50

2

There are plenty of methods, variations of k-means for the case of mixed dataset: k-modes, k-protoypes etc.

It has been discussed already.

edited Apr 13 '17 at 12:50

Community

1

answered Jun 30 '15 at 06:44

IgorS

5,474
11
31
43

score 1 · Answer 2 · edited May 23 '17 at 12:38

Lets break this down...

You have 3 numerical attributes. Great... standardize them by subtracting their means and dividing by their standard deviation. You always need to standardize when you cluster in multiple dimensions or your clustering won't make much sense i.e. the distance vector only makes sense if it is agnostic to how it is oriented.

You have 2 categorical attributes. One, however, is time, which doesn't seem very categorical. Figure out how to turn time into a numerical value. I suggest using a unix time stamp. Don't just blindly call it a categorical feature. Your csv reader didn't know how to treat the colon the slash, so called it a factor, but you can convert it pretty easily and quickly.

Now I would suggest first cluster the 3 numerical factors on their own. Then I would do some feature extraction from the time data. You should be able to extract day of week, week of month, day of month, etc. These could all be useful in seeing some sort of signal in your data. I would also suggest plotting your numerical data as a function of these extracted features and this will give you insight.

Also, think about the value that customer ID might or might not add to the problem. You might be able to remove the letters and it could tell you when they first shopped at the store, or it could just contain useless information that will cloud your clustering.

Finally, think about why you are clustering your data and what insights you want to recover. This will guide you in how to proceed. Do you want to separate predominantly weekend shoppers from predominantly weekday shoppers? Do you want to broadly classify shoppers as one of ten different types and try to ascertain the behavior of each of those types?

Hope this helps!

Kmeans on mixed dataset with high level for categ

2 Answers2