Need your help !!!
I have an input file 1.txt as follows:
"[email protected],[email protected],[email protected]",field2,field3,field4
[email protected],field2,field3,field4
field1,"[email protected],[email protected], [email protected]",field3,field4
field1,"[email protected],[email protected]",,field4
field1,"[email protected]""1234,999,[email protected]",,field4
I want to identify rows that do not have exactly four(4) columns and remove them from the file before loading to a PostGres Table using the copy command.
I am using awk as follows:
awk -v FPAT='([^,]+)||("[^"]+")||(^[ ]*$)' '{printf("Row:%d,NF=%d, %s, %s, %s, %s\n",NR, NF, $1,$2,$3,$4)}' 1.txt
And the output is;
Row:1,NF=4, "[email protected],[email protected], [email protected]", field2, field3, field4
Row:2,NF=4, [email protected], field2, field3, field4
Row:3,NF=4, field1, "[email protected],[email protected], [email protected]", field3, field4
Row:4,NF=4, field1, "[email protected],[email protected]", , field4
Row:5,NF=6, field1, "[email protected]""1234, 999, [email protected]"
You can see the problem with the last row. It has both double quotes and two commas in the value and that is screwing things up. The NF for that row is 6 when it should be 4.
Can anyone suggest the regex to identify the value "[email protected]""1234,999,[email protected]"
Thanks. Venki