On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:
This snippet of code pulls an array of hostnames from some log files. It has to parse around 3GB of log files, so I'm keen on making it as efficient as possible. Can you think of any way to optimize this to run faster?
If the key phrase is *as efficient as possible*, then I would say you want a compiled pattern search. Lex is the tool for this, and for this job is not hard. Lex will generate a specific scanner(*) in C or C++ (depending on what flavor of lex you use). It will probably be table-based. Grep and awk, in contrast, generate scanners on the fly, and specifying complicated regular expressions is somewhat clumsier in grep and awk.
(*) strictly speaking, you are *scanning* not *parsing*. Parsing involves a grammar, and there's no grammar here. If it develops that these domain names are context sensitive, then you will need a grammar.
The suggestions of others -- setting LANG, cutting a specific field, and so on, are all very valuable, and may be *practically* more valuable than writing a scanner with lex, or could be used in conjunction with a "proper" scanner.
Note that lex will allow you to use a much better definition for "domain name" -- such as more than one suffix, names of arbitrary complexity, names that may violate RFC, numeric type names, case sensitivity, names that match certain special templates, like "*.cn" or "goog*.*" and so on.
If you are unfamiliar with lex, note that it is the front end for many a compiler.
BTW, you could easily incorporate a sorting function in lex that would eliminate the need for an external sort. This might be done in awk, too, but in lex it would be more natural. You simply would not enter duplicates in the tree. When the run is over, traverse the tree and out come the unique hostnames. I'm assuming you'll have many collisions. (You could even keep a count of collisions, if you're interested in which hosts are "popular".) Consider btree(3) for this or hash(3).
Dave