On 2008-05-01 11:13, greg@raystedman.org wrote:
Good Morning,
The discussion about "RayStedman.org Bandwidth" inspired me to write a script that reports who the largest bandwidth consumers by ip address and host name. The report looks like this:
3,867,534,553 66.159.202.142 adsl-66-159-202-142.dslextreme.com. 3,847,010,060 190.82.182.19 190-82-182-19.adsl.cust.tie.cl. 1,410,308,739 130.160.110.250 1,051,088,947 216.57.200.57
I'm sure this kind of thing has been done many times in the past by other tools. I thought I would post the script I created just in case it might be helpful to others on this forum.
Thanks again for your feedback on this topic. Greg
Thanks, Greg, for the handy script. There are, of course, some optimizations that are possible, e.g. to eliminate use of temporary files, decrease the number of commands, etc. There's also a slight bug in your first loop, that results in the last "$thisipbw $thisip" pair not being output at the end.
Here's my slightly obfuscated one-liner, which I think accomplishes more or less the same thing as your script. (I've broken it into multiple lines, with indents, for readability.) ...
cat "$basedir"/access_log{,.processed}| cut -d' ' -f1,10|awk '{b[$1]+=$2}END{for(i in b)print b[i],i}'| sort -nr|head -20| while read b i;do echo -n "$b"|sed -e :a -e 's/(.*[0-9])([0-9]{3})/\1,\2/;ta'| sed -e :a -e 's/^.{1,14}$/ &/;ta'; echo -n "$i"|sed -e :a -e 's/^.{1,15}$/ &/;ta'; echo -n " ";echo `host "$i"`|sed -e 's/.*)//' -e 's/.*pointer //'; done
I use the "awk" command (and its associative array feature) to eliminate your first loop entirely. The second loop (to format the output) has been simplified. Your use of "sed" to format the numbers and pad the fields was very clever (and I copied it pretty much as is). I really have to go back and study all the new regular expression features that have been added since the "good old days" when I first picked this up.
Gilbert
#!/bin/bash
# big_bw -- written by Greg Sims 05/01/08
# this script takes as input apache httpd log files access_log and # access_log.processed. a report is generated that contains one line # per ip address with the following fields: bandwidth consumed, # the ip address and the host name associated with the ip address. # # it is important to use mod_logio in the creation of the log files # to ensure the proper number of bytes are recorded in each log # entry. please see http://www.devside.net/guides/config/bytes-sent # how to accomplish this.
# directory where access_log and access_log.processed are located # basedir="/var/www/vhosts/raystedman.net/statistics/logs/"
# create bw.raw containing the ip address and bandwidth for each record; # sort the resulting file by ip address # cd /tmp cat $basedir"access_log" >bw.log cat $basedir"access_log.processed" >>bw.log
cat bw.log | cut -d' ' --field=1,10 | sort >bw.raw
# read through bw.raw and create bw.sum which contains one line per # ip address. each line in bw.sum contains the amount of bandwidth # consumed and the ip address that used the bandwidth # thisip="" rm -f bw.sum
while read inputline; do ip=$(echo "$inputline" | cut -d " " -f 1) bw=$(echo "$inputline" | cut -d " " -f 2) if [ "$bw" = "-" ]; then bw=0 fi
if [ "$thisip" != "$ip" ]; then echo $thisipbw $thisip >>bw.sum thisip=$ip thisipbw=$bw else if [ $bw != "-" ]; then thisipbw=$(( $thisipbw + $bw )) fi fi
done < "bw.raw"
# sort bw.sum so the largest amount of bandwidth used is at the top. # create bw.sum.sort which is the largest 35 consumers of bandwidth. # write a report to stdout doing some formatting in the process. # sort -nr bw.sum | head -n 35 >bw.sum.sort
while read inputline; do bw=$(echo "$inputline" | cut -d " " -f 1) bw=$(echo "$bw" | sed -e :a -e 's/(.*[0-9])([0-9]{3})/\1,\2/;ta') ip=$(echo "$inputline" | cut -d " " -f 2)
echo -n $bw | sed -e :a -e 's/^.{1,14}$/ &/;ta' echo -n " " echo -n $ip | sed -e :a -e 's/^.{1,15}$/ &/;ta' echo -n " " host_name=$(host $ip | sed 's/^.*pointer //' | sed 's/.*DOMAIN)//') host_name=$(echo "$host_name" | sed 's/.*alias for //') echo $host_name
done <"bw.sum.sort"