Column level parsing in pandas data frame

Question

Currently I am working with 20M records with 5 columns. My data frame looks like -

tran_id   id       code
123        1    1759@1@83@0#[email protected]@[email protected]#1094@[email protected]@14.4
254        1    [email protected]@[email protected]
831        2    [email protected]@[email protected]#1101@2@40@0#1108@2@30@0
732        5    1430@[email protected]@2.15#1431@[email protected]@60.29#1074@[email protected]@58.8#1109
141        2    1809@[email protected]@292.66#1816@[email protected]@95.44#1076@[email protected]@1110.61

Desired output -

id       new_code
1        1759
1        1362
1        1094
1        1356
2        1354
2        1101
2        1108
5        1430
5        1431
5        1074
5        1109
2        1809
2        1816
2        1076

What I have done so far -

import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dd= pd.DataFrame({'col' : d["code"].apply(lambda x: re.split('[#  @ ]', x))})
dd.head()    
s = dd['col'].str[:]
dd= pd.DataFrame(s.values.tolist())
dd.head()

cols = range(len(list(dd)))
num_cols = len(list(dd))
new_cols = ['col' + str(i) for i in cols]
dd.columns = new_cols[:num_cols]

Just remember the size of the data is huge...20 million.Can't do any looping.

Thanks in advance

jezrael · Answer 1 · 2019-03-11T07:45:29.510

4

You can use Series.str.findall for extract integers with length 4 between separators:

#https://stackoverflow.com/a/55096994/2901002
s = df['code'].str.findall(r'(?<![^#])\d{4}(?![^@])')
#alternative
#s = df['code'].str.replace('[#@]', ' ').str.findall(r'(?<!\S)\d{4}(?!\S)')

And then create new DataFrame by numpy.repeat with str.len and flaten by chain.from_iterable:

from itertools import chain

df = pd.DataFrame({
    'id' : df['id'].values.repeat(s.str.len()),
    'new_code' : list(chain.from_iterable(s.tolist()))
})
print (df)
    id new_code
0    1     1759
1    1     1362
2    1     1094
3    1     1356
4    2     1354
5    2     1101
6    2     1108
7    5     1430
8    5     1431
9    5     1074
10   5     1109
11   2     1809
12   2     1816
13   2     1076

edited Mar 11 '19 at 07:45

answered Mar 11 '19 at 06:57

jezrael

822,522
95
1,334
1,252

got this error TypeError Traceback (most recent call last) in 1 df = pd.DataFrame({ ----> 2 'id' : df['id'].values.repeat(s.str.len()), 3 'new_code' : list(chain.from_iterable(s.tolist())) 4 }) 5 TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' – Nikita Agarwal Mar 11 '19 at 08:00
@NikitaAgarwal - How working change `s.str.len()` to `s.str.len().astype(np.int64)` ? – jezrael Mar 11 '19 at 08:03
I am not changing anything..my code is given below - df = pd.DataFrame({ 'id' : df['id'].values.repeat(s.str.len()), 'new_code' : list(chain.from_iterable(s.tolist())) }) df.head(20) Currently I am working on full dataset...is this a reason?some of the values are in different format – Nikita Agarwal Mar 11 '19 at 08:12
So same error if use `df = pd.DataFrame({ 'id' : df['id'].values.repeat(s.str.len().astype(np.int64)), 'new_code' : list(chain.from_iterable(s.tolist())) })` ? – jezrael Mar 11 '19 at 08:14

Chris Adams · Answer 2 · 2019-03-11T11:31:06.920

An alternative approach using Series.str.extractall with a different regex pattern:

(df.set_index('id').code.str.extractall(r'(?:[^\.]|^)(?P<new_code>\d{4})')
 .reset_index(0)
 .reset_index(drop=True)
)

[out]

    id new_code
0    1     1759
1    1     1362
2    1     1094
3    1     1356
4    2     1354
5    2     1101
6    2     1108
7    5     1430
8    5     1431
9    5     1074
10   5     1109
11   2     1809
12   2     1816
13   2     1076
14   2     1110

Column level parsing in pandas data frame

2 Answers2