3

I am trying to read a large log file, which has been parsed using different delimiters (legacy changes).

This code works

import os, subprocess, time, re
import pandas as pd

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='[,|;: \t]+', header=None, engine='python', skipinitialspace=True)

        for index, row in df.iterrows():
            print(row[0], row[1])

This works well for the following data

[email protected] address1
[email protected];address2
[email protected],address3
[email protected];;address4
[email protected],,address5

Issue #1: the following row in the input file will break the code. I wish for this to be parsed into 2 columns (not 3)

[email protected],,address;6

Issue #2: I wish to replace all single and double quotes in address, but neither of the following seem to work.

df[1]=df[1].str.replace('"','DQUOTES')
df.replace('"', 'DQUOTES', regex=True)

Pls help!

rogerwhite
  • 335
  • 4
  • 16
  • Try doing pre-processing of normalizing the delimiter before reading the file into pandas dataframe. – sushanth Jun 11 '20 at 05:26

1 Answers1

0

You can first read the file into one column and then do the processing step by step in pandas:

  • split into two columns (n=1)
  • replace the quotes
  • if needed (i.e. if there are possibly further columns you don't need) split the address column again and take the first column ([0]) only (here you may want to remove the space from the list of separators). If any commas and semicolons etc. are part of the address then you of course don't need this step.


import io
s= """[email protected] address1
[email protected];address2
[email protected],address3
[email protected];;address4
[email protected],,address5
[email protected],,address;6
[email protected],,address with "double quotes"
[email protected],,address with 'single quotes'
"""
df = pd.read_csv(io.StringIO(s), sep='\n', header=None)

df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'address'})
df.address = df.address.str.replace('\'|"', 'DQUOTES')
df.address = df.address.str.split('[,|;:]+', 1, expand=True)[0]  #depending on what you need

Result:

             email                                   address
0  [email protected]                                  address1
1  [email protected]                                  address2
2  [email protected]                                  address3
3  [email protected]                                  address4
4  [email protected]                                  address5
5  [email protected]                                   address
6  [email protected]  address with DQUOTESdouble quotesDQUOTES
7  [email protected]  address with DQUOTESsingle quotesDQUOTES
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Thank you @Stef. I have posted a follow-up question. if you could help with that too pls: https://stackoverflow.com/questions/62336455/python-pandas-data-cleaning – rogerwhite Jun 12 '20 at 02:17