[CentOS] Find reason for heavy load

Wed Dec 30 05:55:04 UTC 2009
Ross Walker <rswwalker at gmail.com>

On Dec 29, 2009, at 11:44 PM, Noob Centos Admin  
<centos.admin at gmail.com> wrote:

> My Centos 5 server has seen the average load jumped through the roof  
> recently despite having no major additional clients placed on it.  
> Previously, I was looking at an average of less than 0.6 load, I had  
> a monitoring script that sends an email warning me if the current  
> load stayed above 0.6 for more than 2 minutes. This script used to  
> trigger perhaps once an hour during peak periods. Even so, I seldom  
> see numbers higher than 1.x
>
> On 4th Dec, somebody from an Indian IP range started hammering my  
> SMTP service, attempting to use it as an open relay. Naturally that  
> didn't work and only end up budging my typical 400KB daily log  
> report into 2MB~4MB affairs.
>
> After observing a few days to determine the IP range, I started  
> blocking the Indian subnet with apf. Initially I had problems with  
> getting apf to wok properly but after a couple of days managed to  
> get the block working and my daily log went back down to expected  
> size when all those connection attempts disappear from exim's log.
>
> Now this is when my server load started to shoot through the roof  
> with figures like 8.64 5.90 3.62 being reported by my monitoring  
> script, triggering so often. I had to raise my threshold to 1.6 to  
> keep my own script from spamming myself.
>
> I've tried changing several things on the server, since initially it  
> seems like the high load may be due to I/O wait. So I turning off  
> non-essential services like OpenNMS to see if that had any effect. I  
> also turned off apf and inserted rules manually into iptables to  
> reduce the number of iptable rules the system has to process.
>
> All that doesn't seem to help much, I'm still getting consistent  
> server loads in the 2.x to 3.x range almost all the time.
>
> The problem is using top, none of my processes are showing abnormal  
> CPU%, most are well under 5%, manually adding them up doesn't equate  
> the 200% to 300% the load figures of 2.x and 3.x are indicating.
>
> Even top's own summary says CPU % is in the 20~30% range, what's  
> worrying is the System% is also in the same range. I have no idea  
> what is "system" doing since it appears that anything running inside  
> the kernel is lumped under "system". Or why even totalling both %  
> up, I would expect 50~60% to translate to the expected load of  
> 0.5~0.6 yet system load stats is 5x what's expected.
>
> I've installed utilities like dstat to try to see if I can figure  
> out which process is making the system calls that is clogging up the  
> server but either I don't understand it or it's not the right tool.
>
> So I'll appreciate some advice on how/what should I do next to  
> identify the cause. Thanks in advance!

Try blocking the IPs on the router and see if that helps.

You can also run iostat and look at the disk usage which also  
generates load.

How many cores does your machine have? Load avg is calculated for a  
single core, so a quad core would reach 100% utilization at a load of  
4, but high iowaits can generate an artificially high load avg as well  
(and why one sees greater than 100% utilization).

I really wish load would be broken down as CPU/memory/disk instead of  
the ambiguous load avg, and show network read/write utilization in  
ifconfig.

-Ross