0

I want to delete a substring between a '+' and a '@' symbol together with the '+, if the '+' exists.

d = {'1' : '[email protected]', '2' : '[email protected]', '3' : '[email protected]', '4':'[email protected]'}

test_frame = pd.Series(d)

test_frame
Out[6]: 
1    [email protected]
2            [email protected]
3      [email protected]
4               [email protected]
dtype: object

So, the result should be:

s = {'1' : '[email protected]', '2' : '[email protected]', '3' : '[email protected]', '4':'[email protected]'}

test_frame_result = pd.Series(s)

test_frame_result
Out[10]: 
1    [email protected]
2      [email protected]
3    [email protected]
4         [email protected]
dtype: object

I tried it with split, but due to the fact that only some lines contain a +, it fails.

Is there an elegant solution without looping through all the lines (in the original dataset there are quite many).

Thanks!

maxtenzin
  • 129
  • 4
  • If you don't "loop through all the lines" how can you process all of them? – user202729 Feb 06 '18 at 15:24
  • Does [this](https://stackoverflow.com/questions/4444477/how-to-tell-if-a-string-contains-a-certain-character-in-javascript) solve your problem "only some lines contain a +"? – user202729 Feb 06 '18 at 15:24
  • Have to execute this in Pandas. – maxtenzin Feb 06 '18 at 15:26
  • Sorry, wrong language. – user202729 Feb 06 '18 at 15:27
  • Ad first comment: if I only wanted the first 5 letters I could do that without looping through: test_frame_result.str[:5] – maxtenzin Feb 06 '18 at 15:27
  • What about [this](https://stackoverflow.com/questions/26577516/pandas-test-if-string-contains-one-of-the-substrings-in-a-list)? Also implicitly the slice operator is (most likely) implemented using loops. Just that a loop in C is (often) faster than a loop in a higher level language. – user202729 Feb 06 '18 at 15:28

2 Answers2

1

Is this sufficient?

import pandas as pd
d = {'1' : '[email protected]', 
         '2' : '[email protected]', 
         '3' : '[email protected]', 
         '4':'[email protected]'}

test_frame = pd.Series(d)
test_frame
print test_frame

found = test_frame[test_frame.str.contains(r'\+')]
test_frame[found.index] = found.str.replace(r'\+[^@]*', "")
print test_frame

Output:

(Before)

1    [email protected]
2            [email protected]
3      [email protected]
4               [email protected]
dtype: object

(After)

1    [email protected]
2      [email protected]
3    [email protected]
4         [email protected]
dtype: object
0

Found a solution - probably not the most elegant though:

import pandas as pd

test_frame = pd.DataFrame({'email':['[email protected]','[email protected]','[email protected]','[email protected]']})

test_frame
Out[22]: 
                      email
0  [email protected]
1          [email protected]
2    [email protected]
3             [email protected]

test_frame.loc[test_frame.email.str.contains('\+'),'email'] = test_frame[test_frame.email.str.contains('\+')].email.str.partition('+')[0] + '@' + test_frame[test_frame.email.str.contains('\+')].email.str.partition('+')[2].str.partition('@')[2]

test_frame
Out[24]: 
                email
0  [email protected]
1    [email protected]
2  [email protected]
3       [email protected]
maxtenzin
  • 129
  • 4