2

My retail dataset contains 3 numeric attributes and two categorical attributes Time and ID with 50,000 records. Both categorical attributes have more than 20 thousand levels and their format is 1/11/2011 11:54 and 1TD10051 respectively.

How do I do kmeans on these dataset? Converting categorical to binary will give very sparse dataset?

How to proceed?

IgorS
  • 5,474
  • 11
  • 31
  • 43
joy
  • 61
  • 2
  • 6

2 Answers2

2

There are plenty of methods, variations of k-means for the case of mixed dataset: k-modes, k-protoypes etc.

It has been discussed already.

IgorS
  • 5,474
  • 11
  • 31
  • 43
1

Lets break this down...

You have 3 numerical attributes. Great... standardize them by subtracting their means and dividing by their standard deviation. You always need to standardize when you cluster in multiple dimensions or your clustering won't make much sense i.e. the distance vector only makes sense if it is agnostic to how it is oriented.

You have 2 categorical attributes. One, however, is time, which doesn't seem very categorical. Figure out how to turn time into a numerical value. I suggest using a unix time stamp. Don't just blindly call it a categorical feature. Your csv reader didn't know how to treat the colon the slash, so called it a factor, but you can convert it pretty easily and quickly.

Now I would suggest first cluster the 3 numerical factors on their own. Then I would do some feature extraction from the time data. You should be able to extract day of week, week of month, day of month, etc. These could all be useful in seeing some sort of signal in your data. I would also suggest plotting your numerical data as a function of these extracted features and this will give you insight.

Also, think about the value that customer ID might or might not add to the problem. You might be able to remove the letters and it could tell you when they first shopped at the store, or it could just contain useless information that will cloud your clustering.

Finally, think about why you are clustering your data and what insights you want to recover. This will guide you in how to proceed. Do you want to separate predominantly weekend shoppers from predominantly weekday shoppers? Do you want to broadly classify shoppers as one of ten different types and try to ascertain the behavior of each of those types?

Hope this helps!

AN6U5
  • 6,808
  • 1
  • 24
  • 42