I am trying to work out how to I have a labelled dataset that I want to cluster with scikit-learn k-means. The label's column name is "Classes"
I don't want the labels to interfere with the clustering so I drop the numeric label (range 1-7) and run fit_transform()
.
When I get the result the clustering has assigned its own label_ indicating the cluster the row has been assigned to.
So now I have an original dataset with the labels, and a clustered dataset with cluster numbers (range 0-6) attached to each row. But the cluster numbers are not mappable to the original labels. E.g. "Classes 6" is cluster 2, "Classes 5" is cluster 4, etc.
How can you calculate Cluster Purity when the cluster numbers returned in labels_ has no way to map to the original label in the training data?