Quick start using python and sklearn kmeans?

Question

I started tinkering with sklearn kmeans last night out of curiosity with the goal of clustering users into groups to see what kind of user groups I can derive. I am lost when it comes to plotting the results as most examples have nice (x,y) coordinates. For example, the iris data set has pedal width and pedal length. From my experimentation, I don't seem to have anything that displays very nice. Is this assumption correct / does anyone have tips, pointers, learning resources that they could offer?

import pandas as pd
import pprint
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from collections import defaultdict
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

I normalized the data as it had a wide variance...again, not sure if this is a correct assumption to make

X = np.array(normalize(data, axis=0, copy=False))

kmeans = KMeans(n_clusters=3)
pred = kmeans.fit_predict(X)
labels = kmeans.labels_
cent = kmeans.cluster_centers_

plt.scatter(X[:, [4]], X[:, [6]])
plt.scatter(cent[:, [4]], cent[:, [6]], marker="x", s=150, linewidths=5, zorder=10)
plt.ylabel('Count')
plt.xlabel('Department')
plt.show()

Any pointers are appreciated, I will include sample data below. Thanks!

Sample Data:

emp_type,title,work_country,director_userid,dept_name,business_unit_name,UserCNT
0,9,7,29,20,2,2
0,13,7,8,14,6,5
0,4,3,56,29,8,3
0,15,3,36,32,2,3
0,4,3,32,16,2,0
0,4,1,40,13,6,0
0,4,3,62,12,4,1
0,13,7,61,5,13,4
2,1,3,70,35,15,2
0,4,3,64,4,13,0
2,1,3,43,43,2,0
0,13,7,50,17,16,0
2,1,3,31,26,2,1
2,1,3,65,58,17,0
0,4,3,57,63,12,0
2,1,6,7,45,18,2
2,1,3,43,42,2,0
1,1,7,65,58,17,0
2,1,3,32,16,2,0
2,1,3,29,20,2,0
0,4,0,50,17,16,2
0,5,3,20,23,9,0
0,9,3,32,16,2,2
0,4,3,5,51,12,0
2,1,7,51,53,7,0
0,13,7,37,55,12,0
2,1,4,19,62,13,0

Example clustering using entire data set:

Look at this, In[35], at the end - http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/ch03/ch03.ipynb. It might help. — HonzaB, Oct 04 '16 at 17:47
@HonzaB, thanks! are you telling me to try using knearest neighbors? :)
I added a chart for what my "Clusters" currently look like — anshanno, Oct 04 '16 at 18:05
Not necessarily. Both algorithms are quite similar. The thing is that iris dataset contains 4 variables in fact. So as you can see, usually authors choose two attributes (pedal width and length) which allows them to plot it on 2D scatter plot. So i'd suggest to do the same. Or you can try silhouette analysis. — HonzaB, Oct 04 '16 at 18:20

score 0 · Accepted Answer · edited May 23 '17 at 12:38

0

This has been answered in other places e.g. here.

You could run Principal Component Analysis (or other dimensionality reduction techniques) and plot the cluster for the first two principal components.
You could plot the results for two variables at a time.
You could encode third or fourth variables using standard visualization techniques like color coding, symbols or facetting.
There are ways to visualize the quality of the fit e.g. silhouette analysis or elbow test for determining the number of cluster etc.
Have a quick look at this link

edited May 23 '17 at 12:38

Community

1

answered Oct 05 '16 at 03:30

oW_

6,347
4
28
47

Thanks, I will look into some of the dimensionality reduction techniques.
Does text data just not plot very pretty?
– anshanno Oct 05 '16 at 12:21
I don't know what your data represents but clustering works by calculating distances. Make sure that it actually makes sense to compute distances for your variables e.g. if business_unit_name is just a hash assigning "random" numbers to business units, then distances between those numbers might not be very meaningful (is the distance between business unit 1 and 9 really greater than the one between 1 and 2?) – oW_ Oct 05 '16 at 14:16
The data represents each employee and the number of times they access a particular system. I wanted to see if I could make nice neat clusters. But after your comment I am thinking it's gonna be a lot more work. The distance between each business_unit_name should be equal whether it is 1 -9 or 1 - 2...and same for all other attributes aside from count. – anshanno Oct 05 '16 at 15:03
maybe have a quick look at this link – oW_ Oct 05 '16 at 16:21
This is fantastic, exactly what I was looking for! – anshanno Oct 05 '16 at 16:42

Quick start using python and sklearn kmeans?

1 Answers1