Create new column of cleaned string data with Python/pandas

Question

I have a DataFrame with some user input (it's supposed to just be a plain email address), along with some other values, like this:

import pandas as pd
from pandas import Series, DataFrame

df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <[email protected]>','[email protected]','[email protected]','William Riker <[email protected]>'],'val_1':[1.5,3.6,2.4,2.9],'val_2':[7.3,-2.5,3.4,1.5]})

Due to a bug, the input sometimes has the user's name as well as brackets around the email address; this needs to be fixed before continuing with the analysis.

To move forward, I want to create a new column that has cleaned versions of the emails: if the email contains names/brackets then remove those, else just give the already correct email.

There are numerous examples of cleaning string data with Python/pandas, but I've yet to find successfully implement any of these suggestions. Here are a few examples of what I've tried:

# as noted in pandas docs, turns all non-matching strings into NaN
df['cleaned'] = df['input'].str.extract('<(.*)>')

# AttributeError: type object 'str' has no attribute 'contains'
df['cleaned'] = df['input'].apply(lambda x: str.extract('<(.*)>') if str.contains('<(.*)>') else x)

# AttributeError: 'DataFrame' object has no attribute 'str'
df['cleaned'] = df[df['input'].str.contains('<(.*)>')].str.extract('<(.*)>')

Thanks!

score 0 · Accepted Answer · answered Oct 15 '14 at 14:40

Use np.where to use the str.extract for those rows that contain the embedded email, for the else condition just return the 'input' value:

In [63]:

df['cleaned'] = np.where(df['input'].str.contains('<'), df['input'].str.extract('<(.*)>'), df['input'])

df

Out[63]:
                                            input  val_1  val_2  \
0  Captain Jean-Luc Picard <[email protected]>    1.5    7.3   
1                       [email protected]    3.6   -2.5   
2                              [email protected]    2.4    3.4   
3             William Riker <[email protected]>    2.9    1.5   

                     cleaned  
0       [email protected]  
1  [email protected]  
2         [email protected]  
3        [email protected]

score 0 · Answer 2 · answered Oct 15 '14 at 15:09

0

If you want to use regular expressions:

import re
rex = re.compile(r'<(.*)>')
def fix(s):
    m = rex.search(s)
    if m is None:
        return s
    else:
        return m.groups()[0]
fixed = df['input'].apply(fix)

answered Oct 15 '14 at 15:09

Nathan Lloyd

1,821
4
15
19

Create new column of cleaned string data with Python/pandas

2 Answers2