KMeans for crime patterns

Question

I'm trying to find the patterns in the crime records I have in a database. I thought clustering would be a way to do it.

This is my (cooked up) dataset:

age,nationality,country_of_birth,place_of_birth,no_of_checkedinbaggage,noofcabinbaggage,no_of_co_passengers,watchlist
34,GBR,GBR,London,2,1,0,Drug Trafficker
32,IND,IND,Delhi,2,1,0,Human Trafficker
31,USA,USA,Tampa,2,1,0,Arms Dealer
.....

Basically, I'd want to identify the clusters of watchlists and see if they are having a pattern. Based on the cluster, I'd want to predict future data as well.

Is clustering (K-Means) the correct choice? And also, do all the variables have to be numeric? If so, I'm not sure how I can encode them to numerics. Thoughts?

score 1 · Answer 1 · answered Jan 19 '18 at 09:18

Welcome to the site!

As you know KMeans is an Unsupervised learning and it helps you to find out if there are any patterns in the data. Yes, the procedure which you are following to find some commonality/patterns in the data. But this is not generally used for prediction. FYI you can use K-means for predicting too, recently I came across that but I don't know whether it would yield the desired results.

If the data is categorical you need to apply One-Hot Encoding, which converts the categorical data into numeric, you can go through the link for better understanding. If not so you cannot apply K-Means algorithm.

If you cannot convert the categorical data into numeric data then you can use this package ClustMixType this is in R and you can use KMODES in Python.

If you have your target variable ready, as you haven't mentioned the data target variable, if the target variable is Numeric you can use

Neural Network
Regression
SVM
Random Forest and many more

if the target variable is Binary(is he a criminal yes/no)

Neural Network
SVM
Logistic Regression
Random Forest
Naive Bayes Classifier
KNN and many more.

Please go through this Link, for better understanding on Mixed Data type Clustering

Let me know if you need any help.

If you need some additional information let me know, or else you can accept the answer. Thanks in advance — Toros91, Mar 20 '18 at 08:00

score 0 · Answer 2 · answered Jan 19 '18 at 09:04

Yes for clustering K-Means algorithm is good choice. The only thing is that you should select the number of cluster that you want as result. exp: k = 4

As i see you have categorical variables in your data, you can use "One-Hot-Encoding" to convert them to numerical feature. I would suggest you scale your data after that too.

keywords here: "encode categorical feature", "One Hot Encoder", "feature Scaling". i Suggest you read more about those.

KMeans for crime patterns

2 Answers2