Optimizing grep, sort, uniq for speed

List overview All Threads
Download

newer

older

How to handel smtp to public...

Re: [CentOS] Optimizing grep,...

Sean Carolan

28 Jun 2012 28 Jun '12

6:30 p.m.

This snippet of code pulls an array of hostnames from some log files. It has to parse around 3GB of log files, so I'm keen on making it as efficient as possible. Can you think of any way to optimize this to run faster?

HOSTS=() for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* | sort | uniq); do HOSTS+=("$host") done

Show replies by date

m.roth＠5-cent.us

28 Jun 28 Jun

6:57 p.m.

Sean Carolan wrote:

...

This snippet of code pulls an array of hostnames from some log files. It has to parse around 3GB of log files, so I'm keen on making it as efficient as possible. Can you think of any way to optimize this to run faster?

HOSTS=() for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* | sort | uniq); do HOSTS+=("$host") done

For one, do the sort in one step: sort -u. For another, are the hostnames always the same field? For example, if they're all /var/log/messages, I'd do awk '{print $4;}' | sort -u

mark

Gordon Messmer

7:15 p.m.

On 06/28/2012 11:30 AM, Sean Carolan wrote:

...

Can you think of any way to optimize this to run faster?

HOSTS=() for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* | sort | uniq); do HOSTS+=("$host") done

You have two major performance problems in this script. First, UTF-8 processing is slow. Second, wildcards are EXTREMELY SLOW!

You'll get a small performance improvement by using a C locale, *if* you know that all of your text will be ascii (hostnames will be). You can set LANG either for the whole script or just for grep/sort:

--- $ export LANG=C --- $ env LANG=C grep ... | env LANG=C sort ---

I don't think you'll get much from running uniq in a C locale.

You'll get a HUGE performance boost from prefixing your search with some known prefix to your regex. As it is written, your regex will iterate over every character in each line. If that character is a member of the first set, grep will then iterate over all of the following characters until it finds one that isn't a match, then check for ".com". That second loop increases the processing load tremendously. If you know the prefix, use it, and cut it out in a subsequent stage.

$ grep 'host: [-.0-9a-z][-.0-9a-z]*.com' ${TMPDIR}/* $ egrep '(host:|hostname:|from:) [-.0-9a-z][-.0-9a-z]*.com' \ ${TMPDIR}/*

Gordon Messmer

7:22 p.m.

On 06/28/2012 12:15 PM, Gordon Messmer wrote:

...

You have two major performance problems in this script. First, UTF-8 processing is slow. Second, wildcards are EXTREMELY SLOW!

Naturally, you should test both on your own data. I'm amused to admit that I tested my own advice against my mail log and got more improvement from the LANG setting than the string prefix. The combination of the two reduced the amount of time to run your your pattern against my mail logs by about 90%.

Sean Carolan

7:27 p.m.

Thank you Mark and Gordon. Since the hostnames I needed to collect are in the same field, at least in the lines of the file that are important. I ended up using suggestions from both of you, the code is like this now. The egrep is there to make sure whatever is in the 9th field looks like a domain name.

for host in $(awk '{ print $9 }' ${TMPDIR}/* | egrep "[-.0-9a-z][-.0-9a-z]*.com" | sort -u); do HOSTS+=("$host") done

Original script: real 28m11.488s user 26m57.043s sys 0m30.634s

Using awk instead of grepping the entire batch: real 6m14.949s user 5m0.629s sys 0m26.914s

Using awk and with export LANG=C real 2m50.611s user 1m20.849s sys 0m27.366s

Awesome, thanks for the tips!

...

For one, do the sort in one step: sort -u. For another, are the hostnames always the same field? For example, if they're all /var/log/messages, I'd do awk '{print $4;}' | sort -u

...

You have two major performance problems in this script. First, UTF-8 processing is slow. Second, wildcards are EXTREMELY SLOW!

...

You'll get a HUGE performance boost from prefixing your search with some known prefix to your regex.

m.roth＠5-cent.us

7:37 p.m.

Sean Carolan wrote:

...

Thank you Mark and Gordon. Since the hostnames I needed to collect are in the same field, at least in the lines of the file that are important. I ended up using suggestions from both of you, the code is like this now. The egrep is there to make sure whatever is in the 9th field looks like a domain name.

for host in $(awk '{ print $9 }' ${TMPDIR}/* | egrep "[-.0-9a-z][-.0-9a-z]*.com" | sort -u); do HOSTS+=("$host") done

*sigh* awk is not "cut". What you want is awk '{if (/[-.0-9a-z][-.0-9a-z]*.com/) { print $9;}}' | sort -u

No grep needed; awk looks for what you want *first* this way.

mark, who learned awk in the very early nineties, writing 100-200 line awk scripts....

Sean Carolan

7:50 p.m.

...

*sigh* awk is not "cut". What you want is awk '{if (/[-.0-9a-z][-.0-9a-z]*.com/) { print $9;}}' | sort -u

No grep needed; awk looks for what you want *first* this way.

Thanks, Mark. This is cleaner code but it benchmarked slower than awk then grep.

real 3m35.550s user 2m7.186s sys 0m27.793s

I'll run it a few more times to make sure that it wasn't some other process slowing it down.

I really need to brush up some more on my awk skills!

Sean Carolan

9:04 p.m.

...

...
*sigh* awk is not "cut". What you want is awk '{if (/[-.0-9a-z][-.0-9a-z]*.com/) { print $9;}}' | sort -u

I ended up using this construct in my code; this one fetches out servers that are having issues checking in with puppet:

awk '{if (/Could not find default node or by name with/) { print substr($15, 2, length($15)-2);}}' ${TMPDIR}/* | sort -u

Thanks again, your knowledge and helpfulness is much appreciated.

Woodchuck

8:07 p.m.

On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:

...

This snippet of code pulls an array of hostnames from some log files. It has to parse around 3GB of log files, so I'm keen on making it as efficient as possible. Can you think of any way to optimize this to run faster?

If the key phrase is *as efficient as possible*, then I would say you want a compiled pattern search. Lex is the tool for this, and for this job is not hard. Lex will generate a specific scanner(*) in C or C++ (depending on what flavor of lex you use). It will probably be table-based. Grep and awk, in contrast, generate scanners on the fly, and specifying complicated regular expressions is somewhat clumsier in grep and awk.

(*) strictly speaking, you are *scanning* not *parsing*. Parsing involves a grammar, and there's no grammar here. If it develops that these domain names are context sensitive, then you will need a grammar.

The suggestions of others -- setting LANG, cutting a specific field, and so on, are all very valuable, and may be *practically* more valuable than writing a scanner with lex, or could be used in conjunction with a "proper" scanner.

Note that lex will allow you to use a much better definition for "domain name" -- such as more than one suffix, names of arbitrary complexity, names that may violate RFC, numeric type names, case sensitivity, names that match certain special templates, like "*.cn" or "goog*.*" and so on.

If you are unfamiliar with lex, note that it is the front end for many a compiler.

BTW, you could easily incorporate a sorting function in lex that would eliminate the need for an external sort. This might be done in awk, too, but in lex it would be more natural. You simply would not enter duplicates in the tree. When the run is over, traverse the tree and out come the unique hostnames. I'm assuming you'll have many collisions. (You could even keep a count of collisions, if you're interested in which hosts are "popular".) Consider btree(3) for this or hash(3).

Dave

-- Programming is tedious, but it is still fun after all these years.

4804

Age (days ago)

4804

Last active (days ago)

discuss@lists.centos.org

8 comments

4 participants

tags (0)

participants (4)

Gordon Messmer
m.roth＠5-cent.us
Sean Carolan
Woodchuck