[CentOS] Find reason for heavy load

Wed Dec 30 04:59:48 UTC 2009
John R Pierce <pierce at hogranch.com>

Noob Centos Admin wrote:
> My Centos 5 server has seen the average load jumped through the roof 
> recently despite having no major additional clients placed on it. 
> Previously, I was looking at an average of less than 0.6 load, I had a 
> monitoring script that sends an email warning me if the current load 
> stayed above 0.6 for more than 2 minutes. This script used to trigger 
> perhaps once an hour during peak periods. Even so, I seldom see 
> numbers higher than 1.x
>
> On 4th Dec, somebody from an Indian IP range started hammering my SMTP 
> service, attempting to use it as an open relay. Naturally that didn't 
> work and only end up budging my typical 400KB daily log report into 
> 2MB~4MB affairs.
>
> After observing a few days to determine the IP range, I started 
> blocking the Indian subnet with apf. Initially I had problems with 
> getting apf to wok properly but after a couple of days managed to get 
> the block working and my daily log went back down to expected size 
> when all those connection attempts disappear from exim's log.
>
> Now this is when my server load started to shoot through the roof with 
> figures like 8.64 5.90 3.62 being reported by my monitoring script, 
> triggering so often. I had to raise my threshold to 1.6 to keep my own 
> script from spamming myself.
>
> I've tried changing several things on the server, since initially it 
> seems like the high load may be due to I/O wait. So I turning off 
> non-essential services like OpenNMS to see if that had any effect. I 
> also turned off apf and inserted rules manually into iptables to 
> reduce the number of iptable rules the system has to process.
>
> All that doesn't seem to help much, I'm still getting consistent 
> server loads in the 2.x to 3.x range almost all the time.
>
> The problem is using top, none of my processes are showing abnormal 
> CPU%, most are well under 5%, manually adding them up doesn't equate 
> the 200% to 300% the load figures of 2.x and 3.x are indicating.
>
> Even top's own summary says CPU % is in the 20~30% range, what's 
> worrying is the System% is also in the same range. I have no idea what 
> is "system" doing since it appears that anything running inside the 
> kernel is lumped under "system". Or why even totalling both % up, I 
> would expect 50~60% to translate to the expected load of 0.5~0.6 yet 
> system load stats is 5x what's expected.
>
> I've installed utilities like dstat to try to see if I can figure out 
> which process is making the system calls that is clogging up the 
> server but either I don't understand it or it's not the right tool.
>
> So I'll appreciate some advice on how/what should I do next to 
> identify the cause. Thanks in advance!

last time I saw something like that, it was a bunch of chinese 'bots' 
hammering on my public services like ssh.   another admin had turned 
pop3 on too, this created a very heavy load yet they didn't show up in 
top (bunches of pop3 and ssh processes showed up in ps -auxww, however, 
plug netstat -an