I have a similar problem and tried parallel computing using joblib
. In terms of performance this approach seems okay. However, it appears that joblib
"blocks" RAM memory when repeated very often. So I'm open for alternatives (or suggestions how to terminate the parallel job properly).
from joblib import Parallel, delayed
import distance
import pandas as pd
# https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html
Define some distance measure
def calc_dist(myrow):
return distance.levenshtein(myrow[0], myrow[1])
Some fake data
df = pd.DataFrame({
"text1":["some text","foo","bar","new text","more words"],
"text2":["text","hello","bar","bar","move words"]})
Columns to lists / zip them
l1=df['text1'].tolist()
l2=df['text2'].tolist()
nlist = list(zip(l1,l2))
Calculate distances
dist_vec = Parallel(n_jobs=2)(delayed(calc_dist)(i) for i in nlist)
print(dist_vec)
> [5, 4, 0, 8, 1]