Thank you Mark and Gordon. Since the hostnames I needed to collect are in the same field, at least in the lines of the file that are important. I ended up using suggestions from both of you, the code is like this now. The egrep is there to make sure whatever is in the 9th field looks like a domain name.
for host in $(awk '{ print $9 }' ${TMPDIR}/* | egrep "[-.0-9a-z][-.0-9a-z]*.com" | sort -u); do HOSTS+=("$host") done
Original script: real 28m11.488s user 26m57.043s sys 0m30.634s
Using awk instead of grepping the entire batch: real 6m14.949s user 5m0.629s sys 0m26.914s
Using awk and with export LANG=C real 2m50.611s user 1m20.849s sys 0m27.366s
Awesome, thanks for the tips!
For one, do the sort in one step: sort -u. For another, are the hostnames always the same field? For example, if they're all /var/log/messages, I'd do awk '{print $4;}' | sort -u
You have two major performance problems in this script. First, UTF-8 processing is slow. Second, wildcards are EXTREMELY SLOW!
You'll get a HUGE performance boost from prefixing your search with some known prefix to your regex.