Vectorized String Distance

Question

I am looking for a way to calculate the string distance between two Pandas dataframe columns in a vectorized way. I tried distance and textdistance libraries but they require to use df.apply which is incredibly slow. Do you know any way to have a string distance using only column operations ?

Thanks

You could try using cosine similarity on a bag of words representation of the strings. — Oxbowerce, Feb 22 '22 at 09:20
Thanks @Oxbowerce. Though, it's not ideal because I am measuring distances between email adresses, so the order of letters matters to me. — Anatole, Feb 22 '22 at 09:47
If you use n-grams (e.g. sequences of 4 characters) instead of single characters for the bag of words representation you should still be able to take into account the order of the characters. — Oxbowerce, Feb 22 '22 at 10:03

score 0 · Answer 1 · answered Feb 22 '22 at 10:04

I have a similar problem and tried parallel computing using joblib. In terms of performance this approach seems okay. However, it appears that joblib "blocks" RAM memory when repeated very often. So I'm open for alternatives (or suggestions how to terminate the parallel job properly).

from joblib import Parallel, delayed
import distance
import pandas as pd
# https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html
Define some distance measure
def calc_dist(myrow):
    return distance.levenshtein(myrow[0], myrow[1])
Some fake data
df = pd.DataFrame({
     "text1":["some text","foo","bar","new text","more words"], 
     "text2":["text","hello","bar","bar","move words"]})
Columns to lists / zip them
l1=df['text1'].tolist()
l2=df['text2'].tolist()
nlist = list(zip(l1,l2))
Calculate distances
dist_vec = Parallel(n_jobs=2)(delayed(calc_dist)(i) for i in nlist)
print(dist_vec)
> [5, 4, 0, 8, 1]

score 0 · Accepted Answer · answered Feb 22 '22 at 10:46

I found here that performance across string distance libraries varies greatly : https://github.com/life4/textdistance#benchmarks

The python-Levenshtein library is lightning fast compared to the others so I will use this one. If it's not sufficient I will use parallelism as suggested by @Peter

Vectorized String Distance

2 Answers2

Define some distance measure

Some fake data

Columns to lists / zip them

Calculate distances

Linked