awk regex: Help matching CSV file that has fields with comma within double quotes

Question

Need your help !!!

I have an input file 1.txt as follows:

"[email protected],[email protected],[email protected]",field2,field3,field4
[email protected],field2,field3,field4
field1,"[email protected],[email protected], [email protected]",field3,field4
field1,"[email protected],[email protected]",,field4
field1,"[email protected]""1234,999,[email protected]",,field4

I want to identify rows that do not have exactly four(4) columns and remove them from the file before loading to a PostGres Table using the copy command.

I am using awk as follows:

awk -v FPAT='([^,]+)||("[^"]+")||(^[ ]*$)' '{printf("Row:%d,NF=%d, %s, %s, %s, %s\n",NR, NF, $1,$2,$3,$4)}' 1.txt

And the output is;

Row:1,NF=4, "[email protected],[email protected], [email protected]", field2, field3, field4
Row:2,NF=4, [email protected], field2, field3, field4
Row:3,NF=4, field1, "[email protected],[email protected], [email protected]", field3, field4
Row:4,NF=4, field1, "[email protected],[email protected]", , field4
Row:5,NF=6, field1, "[email protected]""1234, 999, [email protected]"

You can see the problem with the last row. It has both double quotes and two commas in the value and that is screwing things up. The NF for that row is 6 when it should be 4.

Can anyone suggest the regex to identify the value "[email protected]""1234,999,[email protected]"

Thanks. Venki

Just to be clear the value "[email protected]""1234,999,[email protected]" is a valid value, only problem is that it has double quotes and comma in the value. — Venki R, Jan 01 '19 at 11:32

Pushpesh Kumar Rajwanshi · Accepted Answer · 2019-01-01T18:58:25.027

0

One way for solving your problem can be, change the quoted regex part from ("[^"]+") to (".+") which will allow it to capture quoted text including comma and double quote and actually any character. And also, change the order of regex in alternations. Basically, change your regex from this,

([^,]+)||("[^"]+")||(^[ ]*$)

to,

(".+")||([^,]+)||(^[ ]*$)

This way, it will first attempt to capturing a quoted data (based upon first regex (".+")), and will capture "[email protected]""1234,999,[email protected]" where as earlier it was trying to capture this "[email protected]" because your regex was ("[^"]+") which won't allow any double quote or comma to get captured inside the data. So this should fix your capturing problem.

Upon actually running this command,

awk -v FPAT='(".+")||([^,]+)||(^[ ]*$)' '{printf("Row:%d,NF=%d, %s, %s, %s, %s\n",NR, NF, $1,$2,$3,$4)}' awk.txt

on your file data 1.txt1 as mentioned in your post, I got following output, which hopefully should be correct as you expected.

Row:1,NF=4, "[email protected],[email protected],[email protected]", field2, field3, field4
Row:2,NF=4, [email protected], field2, field3, field4
Row:3,NF=4, field1, "[email protected],[email protected], [email protected]", field3, field4
Row:4,NF=4, field1, "[email protected],[email protected]", , field4
Row:5,NF=4, field1, "[email protected]""1234,999,[email protected]", , field4

Let me know if this works for you?

edited Jan 01 '19 at 18:58

answered Jan 01 '19 at 12:08

Pushpesh Kumar Rajwanshi

18,127
2
19
36

1

Absolute genius, worked like a charm!!!, I was almost there but not quite there and that is the difference you all make. You saved my hide, thanks so much. Will close the issue now. – Venki R Jan 02 '19 at 05:20
@user3579374: Glad I could help. Please consider accepting my answer if it solved your issue. Thanks. – Pushpesh Kumar Rajwanshi Jan 02 '19 at 05:40
Sorry - it did create one more problem. There is one more field with the same pattern. And so now it is considering everything in-between the first double quote " to the last double quote in the row and identifying them as one field. Basically my problem is to identify one field as anything that starts with a double quote, ends with a double-quote and then followed by a comma. Imagine $3 == $2 in the last row, can we still get NF == 4 for that row? – Venki R Jan 02 '19 at 07:07
@VenkiR: Yes, that will be a problem if you have such kind of data in more than one column, because it `".+"` is a greedy capture and it will capture everything until last `"`. One work around can be to make it non-greedy capture and have a positive look ahead to stop as soon as `"` is immediately followed by a comma but I don't think look aheads are supported in AWK. However, in that case, you may try using Perl for same. Let me know if you need me to help with a Perl based solution. – Pushpesh Kumar Rajwanshi Jan 02 '19 at 08:19
1

Pushpesh: For now, I did some pre-processing with sed and removed all occurrences of "", and went back to my previous regex FPAT='([^,]+)||("[^"]+")||(^[ ]*$)' and that seems to work. So I will accept your answer and close the issue, thanks for all the help. Much appreciated!!!! – Venki R Jan 02 '19 at 08:37

awk regex: Help matching CSV file that has fields with comma within double quotes

1 Answers1