[CentOS] Optimizing grep, sort, uniq for speed

Thu Jun 28 19:15:14 UTC 2012
Gordon Messmer <yinyang at eburg.com>

On 06/28/2012 11:30 AM, Sean Carolan wrote:
> Can you think of any way to optimize this to run faster?
>
> HOSTS=()
> for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com" ${TMPDIR}/* |
> sort | uniq); do
>      HOSTS+=("$host")
> done

You have two major performance problems in this script.  First, UTF-8 
processing is slow.  Second, wildcards are EXTREMELY SLOW!

You'll get a small performance improvement by using a C locale, *if* you 
know that all of your text will be ascii (hostnames will be).  You can 
set LANG either for the whole script or just for grep/sort:

---
$ export LANG=C
---
$ env LANG=C grep ... | env LANG=C sort
---

I don't think you'll get much from running uniq in a C locale.

You'll get a HUGE performance boost from prefixing your search with some 
known prefix to your regex.  As it is written, your regex will iterate 
over every character in each line.  If that character is a member of the 
first set, grep will then iterate over all of the following characters 
until it finds one that isn't a match, then check for ".com".  That 
second loop increases the processing load tremendously.  If you know the 
prefix, use it, and cut it out in a subsequent stage.

$ grep 'host: [-\.0-9a-z][-\.0-9a-z]*.com' ${TMPDIR}/*
$ egrep '(host:|hostname:|from:) [-\.0-9a-z][-\.0-9a-z]*.com' \
   ${TMPDIR}/*