On 06/28/2012 11:30 AM, Sean Carolan wrote:
Can you think of any way to optimize this to run faster?
HOSTS=() for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* | sort | uniq); do HOSTS+=("$host") done
You have two major performance problems in this script. First, UTF-8 processing is slow. Second, wildcards are EXTREMELY SLOW!
You'll get a small performance improvement by using a C locale, *if* you know that all of your text will be ascii (hostnames will be). You can set LANG either for the whole script or just for grep/sort:
--- $ export LANG=C --- $ env LANG=C grep ... | env LANG=C sort ---
I don't think you'll get much from running uniq in a C locale.
You'll get a HUGE performance boost from prefixing your search with some known prefix to your regex. As it is written, your regex will iterate over every character in each line. If that character is a member of the first set, grep will then iterate over all of the following characters until it finds one that isn't a match, then check for ".com". That second loop increases the processing load tremendously. If you know the prefix, use it, and cut it out in a subsequent stage.
$ grep 'host: [-.0-9a-z][-.0-9a-z]*.com' ${TMPDIR}/* $ egrep '(host:|hostname:|from:) [-.0-9a-z][-.0-9a-z]*.com' \ ${TMPDIR}/*