4

I am trying to merge two address columns into one and separate the resulting string with '--'. The dataset has 10 million rows and 33 columns - but the number of rows grows for a million or so a month. This line in Pandas/Python is very slow. Any ideas of how can I make it faster and scalable for future use?

df['address'] = df[['address1', 'address2']].apply(lambda x: '--'.join(x.dropna().astype(str).values), axis=1)
VividD
  • 656
  • 7
  • 18
user31602
  • 41
  • 1
  • 2

1 Answers1

1

With my solution, you have to parse your column with string type:

First:

df[["address1", "address2"]] = df[["address1", "address2"]].dropna().astype(str)

Then define concatenation fonction to concatenate two strings

def concat_string(a, b):
    return a + '--' +  b

Finally I advise you to work with pandas series, it will be improve your operations.

Here I apply combine (from pandas series) with in param the function define above concat_string. concat_string takes in param element from df['address1'] and combine them with df['address2'].

Combine vectorize the operation, it "replaces" traditional loop.

df['address'] = df['address1'].combine(df['address2'], concat_string)

Benchmark:

df.shape => (10000, 2)

%time df['address'] = df[['address1', 'address2']].apply(lambda x: '--'.join(x.dropna().astype(str).values), axis=1)
CPU times: user 4.6 s, sys: 18.8 ms, total: 4.62 s
Wall time: 4.64 s

%time df['address_Test'] = df['address1'].combine(df['address2'], concat_string)
CPU times: user 302 ms, sys: 27.6 ms, total: 329 ms
Wall time: 321 ms
Jean S
  • 111
  • 2
  • Thank you @Jean s. I can't comment on your answer but you are not calling concat_string method with any parameters here: df['address_Test'] = df['address1'].combine(df['address2'], concat_string) I guess I will still need to loop there, to call it with params which will affect performance? – BlueIvy Apr 26 '17 at 12:23