Performance issues when merging two dataframe columns into one on millions rows with Pandas

Question

I am trying to merge two address columns into one and separate the resulting string with '--'. The dataset has 10 million rows and 33 columns - but the number of rows grows for a million or so a month. This line in Pandas/Python is very slow. Any ideas of how can I make it faster and scalable for future use?

df['address'] = df[['address1', 'address2']].apply(lambda x: '--'.join(x.dropna().astype(str).values), axis=1)

Jean S · Answer 1 · 2017-04-26T13:00:38.670

With my solution, you have to parse your column with string type:

First:

df[["address1", "address2"]] = df[["address1", "address2"]].dropna().astype(str)

Then define concatenation fonction to concatenate two strings

def concat_string(a, b):
    return a + '--' +  b

Finally I advise you to work with pandas series, it will be improve your operations.

Here I apply combine (from pandas series) with in param the function define above concat_string. concat_string takes in param element from df['address1'] and combine them with df['address2'].

Combine vectorize the operation, it "replaces" traditional loop.

df['address'] = df['address1'].combine(df['address2'], concat_string)

Benchmark:

df.shape => (10000, 2)

%time df['address'] = df[['address1', 'address2']].apply(lambda x: '--'.join(x.dropna().astype(str).values), axis=1)
CPU times: user 4.6 s, sys: 18.8 ms, total: 4.62 s
Wall time: 4.64 s

%time df['address_Test'] = df['address1'].combine(df['address2'], concat_string)
CPU times: user 302 ms, sys: 27.6 ms, total: 329 ms
Wall time: 321 ms

Thank you @Jean s. I can't comment on your answer but you are not calling concat_string method with any parameters here: df['address_Test'] = df['address1'].combine(df['address2'], concat_string) I guess I will still need to loop there, to call it with params which will affect performance? — BlueIvy, Apr 26 '17 at 12:23

Performance issues when merging two dataframe columns into one on millions rows with Pandas

1 Answers1

Linked