[CentOS] Optimizing grep, sort, uniq for speed

Thu Jun 28 20:39:02 UTC 2012
m.roth at 5-cent.us <m.roth at 5-cent.us>

Woodchuck wrote:
> On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:
>> This snippet of code pulls an array of hostnames from some log files.
>> It has to parse around 3GB of log files, so I'm keen on making it as
>> efficient as possible.  Can you think of any way to optimize this to
>> run faster?
>
> If the key phrase is *as efficient as possible*, then I would say
> you want a compiled pattern search.  Lex is the tool for this, and

That, to me, would be a Big Deal.
<snip>
> BTW, you could easily incorporate a sorting function in lex that
> would eliminate the need for an external sort.  This might be done in awk,
> too, but in lex it would be more natural.  You simply would not
<snip>
Hello, mark, wake up.

Of course, there's an even easier way, just using awk:

awk '{if (/[-\.0-9a-z][-\.0-9a-z]*.com/) { hostarray[$9] = 1;}} END { for
(i in hostarray ) { print i;}}'

This dumps it into an associative array - that's one whose indices are a
string - so it will by default be in order.

       mark