Re: [CentOS] Optimizing grep, sort, uniq for speed

28 Jun 2012


      On 06/28/2012 11:30 AM, Sean Carolan wrote:
...
Can you think of any way to optimize this to run faster?
HOSTS=()
for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* |
sort | uniq); do
     HOSTS+=("$host")
done
You have two major performance problems in this script.  First, UTF-8 
processing is slow.  Second, wildcards are EXTREMELY SLOW!
You'll get a small performance improvement by using a C locale, *if* you 
know that all of your text will be ascii (hostnames will be).  You can 
set LANG either for the whole script or just for grep/sort:
---
$ export LANG=C
---
$ env LANG=C grep ... | env LANG=C sort
---
I don't think you'll get much from running uniq in a C locale.
You'll get a HUGE performance boost from prefixing your search with some 
known prefix to your regex.  As it is written, your regex will iterate 
over every character in each line.  If that character is a member of the 
first set, grep will then iterate over all of the following characters 
until it finds one that isn't a match, then check for ".com".  That 
second loop increases the processing load tremendously.  If you know the 
prefix, use it, and cut it out in a subsequent stage.
$ grep 'host: [-.0-9a-z][-.0-9a-z]*.com' ${TMPDIR}/*
$ egrep '(host:|hostname:|from:) [-.0-9a-z][-.0-9a-z]*.com' \
   ${TMPDIR}/*

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [CentOS] Optimizing grep, sort, uniq for speed