[CentOS] Find reason for heavy load

Wed Dec 30 21:44:27 UTC 2009
Ugo Bellavance <ugob at lubik.ca>

On 2009-12-29 23:44, Noob Centos Admin wrote:
> My Centos 5 server has seen the average load jumped through the roof
> recently despite having no major additional clients placed on it.
> Previously, I was looking at an average of less than 0.6 load, I had a
> monitoring script that sends an email warning me if the current load
> stayed above 0.6 for more than 2 minutes. This script used to trigger
> perhaps once an hour during peak periods. Even so, I seldom see numbers
> higher than 1.x
> On 4th Dec, somebody from an Indian IP range started hammering my SMTP
> service, attempting to use it as an open relay. Naturally that didn't
> work and only end up budging my typical 400KB daily log report into
> 2MB~4MB affairs.
> After observing a few days to determine the IP range, I started blocking
> the Indian subnet with apf. Initially I had problems with getting apf to
> wok properly but after a couple of days managed to get the block working
> and my daily log went back down to expected size when all those
> connection attempts disappear from exim's log.
> Now this is when my server load started to shoot through the roof with
> figures like 8.64 5.90 3.62 being reported by my monitoring script,
> triggering so often. I had to raise my threshold to 1.6 to keep my own
> script from spamming myself.
> I've tried changing several things on the server, since initially it
> seems like the high load may be due to I/O wait. So I turning off
> non-essential services like OpenNMS to see if that had any effect. I
> also turned off apf and inserted rules manually into iptables to reduce
> the number of iptable rules the system has to process.
> All that doesn't seem to help much, I'm still getting consistent server
> loads in the 2.x to 3.x range almost all the time.
> The problem is using top, none of my processes are showing abnormal
> CPU%, most are well under 5%, manually adding them up doesn't equate the
> 200% to 300% the load figures of 2.x and 3.x are indicating.
> Even top's own summary says CPU % is in the 20~30% range, what's
> worrying is the System% is also in the same range. I have no idea what
> is "system" doing since it appears that anything running inside the
> kernel is lumped under "system". Or why even totalling both % up, I
> would expect 50~60% to translate to the expected load of 0.5~0.6 yet
> system load stats is 5x what's expected.
> I've installed utilities like dstat to try to see if I can figure out
> which process is making the system calls that is clogging up the server
> but either I don't understand it or it's not the right tool.
> So I'll appreciate some advice on how/what should I do next to identify
> the cause. Thanks in advance!

Dstat could at least tell you if your problem is CPU or I/O.

Even better, run

vmstat 2 10

Look at the first two columns.  What column have higher numbers?  If r, 
you're CPU-bound.  If b, you're I/O bound.

If you're I/O bound, I suggest you use atop to determine which processes 
take disk time.

You can also use iostat -x 2 10.

I really suggest you read on vmstat and iostat, they will always be helpful.

Did you check if you have a defect disk or a rebuilding array?  That 
could be the cause.