0

I am looking for a way to calculate the string distance between two Pandas dataframe columns in a vectorized way. I tried distance and textdistance libraries but they require to use df.apply which is incredibly slow. Do you know any way to have a string distance using only column operations ?

Thanks

Anatole
  • 181
  • 1
  • 8
  • You could try using cosine similarity on a bag of words representation of the strings. – Oxbowerce Feb 22 '22 at 09:20
  • Thanks @Oxbowerce. Though, it's not ideal because I am measuring distances between email adresses, so the order of letters matters to me. – Anatole Feb 22 '22 at 09:47
  • If you use n-grams (e.g. sequences of 4 characters) instead of single characters for the bag of words representation you should still be able to take into account the order of the characters. – Oxbowerce Feb 22 '22 at 10:03

2 Answers2

0

I have a similar problem and tried parallel computing using joblib. In terms of performance this approach seems okay. However, it appears that joblib "blocks" RAM memory when repeated very often. So I'm open for alternatives (or suggestions how to terminate the parallel job properly).

from joblib import Parallel, delayed
import distance
import pandas as pd
# https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

Define some distance measure

def calc_dist(myrow): return distance.levenshtein(myrow[0], myrow[1])

Some fake data

df = pd.DataFrame({ "text1":["some text","foo","bar","new text","more words"], "text2":["text","hello","bar","bar","move words"]})

Columns to lists / zip them

l1=df['text1'].tolist() l2=df['text2'].tolist() nlist = list(zip(l1,l2))

Calculate distances

dist_vec = Parallel(n_jobs=2)(delayed(calc_dist)(i) for i in nlist)

print(dist_vec) > [5, 4, 0, 8, 1]

Peter
  • 7,446
  • 5
  • 19
  • 49
0

I found here that performance across string distance libraries varies greatly : https://github.com/life4/textdistance#benchmarks

The python-Levenshtein library is lightning fast compared to the others so I will use this one. If it's not sufficient I will use parallelism as suggested by @Peter

Anatole
  • 181
  • 1
  • 8