Run awk in parallel

Question

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print $0 >> Fpath & ?

Any help will be appreciated.

Sample log file

"[email protected]:datahere2     
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2     
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2

# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
    awk '
    BEGIN {
        FS=":"
    }
    {
        gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
        $0=gensub("[,|;: \t]+",":",1,$0)
        if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
        {
            Fpath=tolower(substr($1,1,2))
            Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
            print $0 >> Fpath
        }
        else
            print $0 >> "errorfile"
    }' "$FILE"
done
popd > /dev/null

How many distinct files will be generated. The example that you have has only one output file (em). It is safe to assume real code will have ~1000 output files (36*36) ? Also, can you provide line count (approximate), as processing is impacted by line count and total size — dash-o, Jun 18 '20 at 06:01
Suggest you check what is the bottleneck: IO, CPU, ... Run 'time ...', and check user time vs clock time. If you machine has multiple cores (most like) parallel processing can reduce the impact of user time. If ratio user time/clock time is low, the problem is somewhere else — dash-o, Jun 18 '20 at 06:08
potential files: 37*37 [26 alphabets, 10 digits, underscore] ; indeed i wish to run this in parallel. But how do i do that? pls help with the code (am a newbie) — rogerwhite, Jun 18 '20 at 06:15
I haven't tested this idea because there is no sensibly sized data available to test on, but I wonder if you could prefilter the files through `tr -d` to get rid of spaces, tabs, quotes, slashes apostrophes before feeding them to `awk` so you don't have to do all that checking with `gensub`. I also wonder how costly that check is that every character is less than \x7F - maybe that is best pre-filtered by a `grep`. — Mark Setchell, Jun 18 '20 at 08:24
I also wonder about the overhead of calling `gensub()` and compiling a regex to work out a 2-letter `Fpath` and would maybe try a pair of `if` statements to check the 2 letters in turn. — Mark Setchell, Jun 18 '20 at 08:33
It might be an idea to profile the code a bit to see what the bottlenecks are. First, try commenting out the whole `if-then-else` and running it to see how long the initial `gsub()` and `gensub()` take. Then try still running the `if` test but commenting out the output to Fpath to see how much longer the `if` test takes. — Mark Setchell, Jun 18 '20 at 08:38

Ed Morton · Accepted Answer · 2020-06-20T11:32:47.743

Look up the man page for the GNU tool named parallel if you want to run things in parallel but we can vastly improve the execution speed just by improving your script.

Your current script makes 2 mistakes that greatly impact efficiency:

Calling awk once per file instead of once for all files, and
Leaving all output files open while the script is running so awk has to manage them

You currently, essentially, do:

for file in *; do
    awk '
        {
            Fpath = substr($1,1,2)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",Fpath)
            print > Fpath
        }
     ' "$file"
done

If you do this instead it'll run much faster:

sort * |
    awk '
        { curr = substr($0,1,2) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { print > Fpath }
    '

Having said that, you're manipulating your input lines before figuring out the output file names so - this is untested but I THINK your whole script should look like this:

#/usr/bin/env bash

pushd "_test2" > /dev/null

awk '
    {
        gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
        sub(/[,|;: \t]+/, ":")
        if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:[\x00-\x7F]+$/) {
            print
        }
        else {
            print > "errorfile"
        }
    } 
' * |
sort -t':' -k1,1 |
awk '
    { curr = substr($0,1,2) }
    curr != prev {
        close(Fpath)
        Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
        prev = curr
    }
    { print > Fpath }
'

popd > /dev/null

Note the use of $0 instead of $1 in the scripts - that's another performance improvement because awk only does field splitting (which takes time of course) if you name specific fields in your script.

Ed: to save space, I tried to change `print > Fpath` to `print | "gzip -f -9 >> " Fpath`. This throws out strange errros, with random filenames. Any thoughts pls? `awk: cmd. line:10: (FILENAME=- FNR=92538) fatal: cannot open pipe `gzip -f -9 >> 5z' for output: Resource temporarily unavailable` — rogerwhite, Jun 20 '20 at 01:42
Or would it be better if I put in a `gzip -9 -f -r .` at the end of the script — rogerwhite, Jun 20 '20 at 02:00
You're probably still closing `Fpath` instead of the full pipe you have open ending in `Fpath`. Try changing `Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)` to `Fpath = "gzip -f -9 >> " gensub(/[^[:alnum:]]/,"_","g",curr)` and `print > Fpath` to `print | Fpath`. — Ed Morton, Jun 20 '20 at 02:00
Ed: pls refer to : https://stackoverflow.com/questions/62480910/make-awk-efficient-again — rogerwhite, Jun 20 '20 at 02:21

dash-o · Answer 2 · 2020-06-18T06:39:13.497

Assuming multiple cores are available, the simple way to run parallel is to use xargs, Depending on your config try 2, 3, 4, 5, ... until you find the optimal number. This assumes that there are multiple input files, and that there is NO single files that is much larger than all other files.

Notice added 'fflush' so that lines will not be split. This will have some negative performance impact, but is required, assuming you the individual input files to get merged into single set of output files. Possible to wrokaround this problem by splitting each file, and then merging the combined files.

#! /bin/sh
pushd "_test2" > /dev/null
ls * | xargs --max-procs=4 -L1 awk '
    BEGIN {
        FS=":"
    }
    {
        gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
        $0=gensub("[,|;: \t]+",":",1,$0)
        if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
        {
            Fpath=tolower(substr($1,1,2))
            Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
            print $0 >> Fpath
            fflush(Fpath)
        }
        else
            print $0 >> "errorfile"
            fflush("errorfile")

    }' "$FILE"
popd > /dev/null

From practical point of view you might want to create an awk script, e.g., split.awk

#! /usr/bin/awk -f -
BEGIN {
    FS=":"
}
{
    gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
    $0=gensub("[,|;: \t]+",":",1,$0)
    if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
    {
        Fpath=tolower(substr($1,1,2))
        Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
        print $0 >> Fpath
    }
    else
        print $0 >> "errorfile"
}

And then the 'main' code will look like below, easier to manage.

xargs --max-procs=4 -L1 awk -f split.awk

Run awk in parallel

2 Answers2