1

I am trying to work out how to I have a labelled dataset that I want to cluster with scikit-learn k-means. The label's column name is "Classes"

I don't want the labels to interfere with the clustering so I drop the numeric label (range 1-7) and run fit_transform().

When I get the result the clustering has assigned its own label_ indicating the cluster the row has been assigned to.

So now I have an original dataset with the labels, and a clustered dataset with cluster numbers (range 0-6) attached to each row. But the cluster numbers are not mappable to the original labels. E.g. "Classes 6" is cluster 2, "Classes 5" is cluster 4, etc.

How can you calculate Cluster Purity when the cluster numbers returned in labels_ has no way to map to the original label in the training data?

Ethan
  • 1,633
  • 9
  • 24
  • 39
Bryon
  • 111
  • 4
  • Wait, but your labels do not figure into this at all. k-means is not predicting a label to begin with. How are you expecting to compare cluster assignments to labels? are you trying to figure out if clusters generally have all the same label or not? – Sean Owen Apr 18 '22 at 02:19
  • Hi Steve. Yes you are right. They don’t factor into the clustering. I think I worked it out. This works because the fit() does not change the order of the input data… What needs to be done is to take the identified clusters from the labels_ attribute and add the original labels (hereafter called originals) - giving you an nx2 Dataframe. It is then a process of finding the highest frequency originals for each identified cluster. You sum the number of highest frequency originals and divide by the total space size. If 100% of originals were in their own cluster then you get 1. – Bryon Apr 18 '22 at 02:28
  • I’ll add my code here shortly – Bryon Apr 18 '22 at 02:30
  • I don't follow what you're trying to evaluate or what this metric would measure. You can simply calculate the 'entropy' of the labels in the clusters as a metric of purity, but, not clear if that's what you mean – Sean Owen Apr 19 '22 at 16:38

0 Answers0