Currently I am working with 20M records with 5 columns. My data frame looks like -
tran_id id code
123 1 1759@1@83@0#[email protected]@[email protected]#1094@[email protected]@14.4
254 1 [email protected]@[email protected]
831 2 [email protected]@[email protected]#1101@2@40@0#1108@2@30@0
732 5 1430@[email protected]@2.15#1431@[email protected]@60.29#1074@[email protected]@58.8#1109
141 2 1809@[email protected]@292.66#1816@[email protected]@95.44#1076@[email protected]@1110.61
Desired output -
id new_code
1 1759
1 1362
1 1094
1 1356
2 1354
2 1101
2 1108
5 1430
5 1431
5 1074
5 1109
2 1809
2 1816
2 1076
What I have done so far -
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dd= pd.DataFrame({'col' : d["code"].apply(lambda x: re.split('[# @ ]', x))})
dd.head()
s = dd['col'].str[:]
dd= pd.DataFrame(s.values.tolist())
dd.head()
cols = range(len(list(dd)))
num_cols = len(list(dd))
new_cols = ['col' + str(i) for i in cols]
dd.columns = new_cols[:num_cols]
Just remember the size of the data is huge...20 million.Can't do any looping.
Thanks in advance