I have a DataFrame
with some user input (it's supposed to just be a plain email address), along with some other values, like this:
import pandas as pd
from pandas import Series, DataFrame
df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <[email protected]>','[email protected]','[email protected]','William Riker <[email protected]>'],'val_1':[1.5,3.6,2.4,2.9],'val_2':[7.3,-2.5,3.4,1.5]})
Due to a bug, the input sometimes has the user's name as well as brackets around the email address; this needs to be fixed before continuing with the analysis.
To move forward, I want to create a new column that has cleaned versions of the emails: if the email contains names/brackets
then remove those, else just give the already correct email.
There are numerous examples of cleaning string data with Python/pandas
, but I've yet to find successfully implement any of these suggestions. Here are a few examples of what I've tried:
# as noted in pandas docs, turns all non-matching strings into NaN
df['cleaned'] = df['input'].str.extract('<(.*)>')
# AttributeError: type object 'str' has no attribute 'contains'
df['cleaned'] = df['input'].apply(lambda x: str.extract('<(.*)>') if str.contains('<(.*)>') else x)
# AttributeError: 'DataFrame' object has no attribute 'str'
df['cleaned'] = df[df['input'].str.contains('<(.*)>')].str.extract('<(.*)>')
Thanks!