[CentOS] Optimizing grep, sort, uniq for speed
Sean Carolan
scarolan at gmail.com
Thu Jun 28 19:27:24 UTC 2012
Thank you Mark and Gordon. Since the hostnames I needed to collect
are in the same field, at least in the lines of the file that are
important. I ended up using suggestions from both of you, the code is
like this now. The egrep is there to make sure whatever is in the 9th
field looks like a domain name.
for host in $(awk '{ print $9 }' ${TMPDIR}/* | egrep
"[-\.0-9a-z][-\.0-9a-z]*.com" | sort -u); do
HOSTS+=("$host")
done
Original script:
real 28m11.488s
user 26m57.043s
sys 0m30.634s
Using awk instead of grepping the entire batch:
real 6m14.949s
user 5m0.629s
sys 0m26.914s
Using awk and with export LANG=C
real 2m50.611s
user 1m20.849s
sys 0m27.366s
Awesome, thanks for the tips!
> For one, do the sort in one step: sort -u. For another, are the hostnames
> always the same field? For example, if they're all /var/log/messages, I'd
> do awk '{print $4;}' | sort -u
> You have two major performance problems in this script. First, UTF-8
> processing is slow. Second, wildcards are EXTREMELY SLOW!
> You'll get a HUGE performance boost from prefixing your search with some
> known prefix to your regex.
More information about the CentOS
mailing list