Choosing a distance metric and measuring similarity

Question

I am trying to decide which particular algorithm would be most appropriate for my use-case.

I have dataset of about 1000 physical buildings in a city with feature space such as location, distance, year built and other characteristics etc. For each new data point, a building, I'd like to find 3-5 buildings that are most similar based on feature space comparison.

I define similarity as weighted comparison of features. I'd like to iterate over entire feature space (w/ filter like location) and choose 3-5 most similar buildings matching the new building data point.

Here's what my data looks like:

I'm wondering what similarity measure would make sense? I work in python, so prefer a pythonic/sci-kit learn way of doing this.

score 2 · Accepted Answer · answered Jul 05 '20 at 07:23

2

It appears to me that what you're looking for in your use-case is not clustering - it's a distance metric.

When you get a new data point, you want to find the 3-5 most similar data points; there's no need for clustering for it. Calculate the distance from the new data point to each of the 'old' data points, and select the top 3-5.

Now, which distance metric to pick? There are options. If you're using SKLearn, I'd look over this page for example of distance(/similarity) metrics.

If your features are continuous, you can normalize them and use cosine similarity; Start with this, and see if it fits.

answered Jul 05 '20 at 07:23

Itamar Mushkin

1,061
4
17

This makes sense. I am trying to figure out the most appropriate similarity metric for the data I have. (Updated the question with some sample data). Any thoughts on which distance metric make sense to rank properties by similarity? – kms Jul 11 '20 at 14:27
... Are these all of your features? It seems to me like almost all of them are categorical – Itamar Mushkin Jul 12 '20 at 05:30
There are others. It's a combination of categorical and continuous features. – kms Jul 12 '20 at 05:38
1

As a start - plug the categorical variables into a one-hot-encoder, normalize all non-binary features (or normalize all with min-max), and see what cosine similarity yields. That's not a magic trick that's sure to work, but it's a start, and it'll help you see where it makes and doesn't make sense. Also, when searching on this site I found your previous question, it has some leads: https://datascience.stackexchange.com/questions/8681/clustering-for-mixed-numeric-and-nominal-discrete-data – Itamar Mushkin Jul 12 '20 at 05:46

Choosing a distance metric and measuring similarity

1 Answers1