[CentOS] Optimizing grep, sort, uniq for speed

Thu Jun 28 20:07:28 UTC 2012
Woodchuck <marmot at pennswoods.net>

On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:
> This snippet of code pulls an array of hostnames from some log files.
> It has to parse around 3GB of log files, so I'm keen on making it as
> efficient as possible.  Can you think of any way to optimize this to
> run faster?

If the key phrase is *as efficient as possible*, then I would say
you want a compiled pattern search.  Lex is the tool for this, and
for this job is not hard.  Lex will generate a specific scanner(*)
in C or C++ (depending on what flavor of lex you use). It will probably
be table-based.  Grep and awk, in contrast, generate scanners on the
fly, and specifying complicated regular expressions is somewhat
clumsier in grep and awk.

(*) strictly speaking, you are *scanning* not *parsing*.  Parsing
involves a grammar, and there's no grammar here.  If it develops that
these domain names are context sensitive, then you will need a grammar.

The suggestions of others -- setting LANG, cutting a specific field,
and so on, are all very valuable, and may be *practically* more valuable
than writing a scanner with lex, or could be used in conjunction
with a "proper" scanner.

Note that lex will allow you to use a much better definition for
"domain name" -- such as more than one suffix, names of arbitrary
complexity, names that may violate RFC, numeric type names, case
sensitivity, names that match certain special templates, like
"*.cn" or "goog*.*" and so on.

If you are unfamiliar with lex, note that it is the front end for
many a compiler.  

BTW, you could easily incorporate a sorting function in lex that
would eliminate the need for an external sort.  This might be done in awk,
too, but in lex it would be more natural.  You simply would not
enter duplicates in the tree.  When the run is over, traverse the
tree and out come the unique hostnames.  I'm assuming you'll have
many collisions.  (You could even keep a count of collisions, if you're
interested in which hosts are "popular".)  Consider btree(3) for this
or hash(3).

Dave
-- 
   Programming is tedious, but it is still fun after all these years.