On 2009-12-29 23:44, Noob Centos Admin wrote: > My Centos 5 server has seen the average load jumped through the roof > recently despite having no major additional clients placed on it. > Previously, I was looking at an average of less than 0.6 load, I had a > monitoring script that sends an email warning me if the current load > stayed above 0.6 for more than 2 minutes. This script used to trigger > perhaps once an hour during peak periods. Even so, I seldom see numbers > higher than 1.x > > On 4th Dec, somebody from an Indian IP range started hammering my SMTP > service, attempting to use it as an open relay. Naturally that didn't > work and only end up budging my typical 400KB daily log report into > 2MB~4MB affairs. > > After observing a few days to determine the IP range, I started blocking > the Indian subnet with apf. Initially I had problems with getting apf to > wok properly but after a couple of days managed to get the block working > and my daily log went back down to expected size when all those > connection attempts disappear from exim's log. > > Now this is when my server load started to shoot through the roof with > figures like 8.64 5.90 3.62 being reported by my monitoring script, > triggering so often. I had to raise my threshold to 1.6 to keep my own > script from spamming myself. > > I've tried changing several things on the server, since initially it > seems like the high load may be due to I/O wait. So I turning off > non-essential services like OpenNMS to see if that had any effect. I > also turned off apf and inserted rules manually into iptables to reduce > the number of iptable rules the system has to process. > > All that doesn't seem to help much, I'm still getting consistent server > loads in the 2.x to 3.x range almost all the time. > > The problem is using top, none of my processes are showing abnormal > CPU%, most are well under 5%, manually adding them up doesn't equate the > 200% to 300% the load figures of 2.x and 3.x are indicating. > > Even top's own summary says CPU % is in the 20~30% range, what's > worrying is the System% is also in the same range. I have no idea what > is "system" doing since it appears that anything running inside the > kernel is lumped under "system". Or why even totalling both % up, I > would expect 50~60% to translate to the expected load of 0.5~0.6 yet > system load stats is 5x what's expected. > > I've installed utilities like dstat to try to see if I can figure out > which process is making the system calls that is clogging up the server > but either I don't understand it or it's not the right tool. > > So I'll appreciate some advice on how/what should I do next to identify > the cause. Thanks in advance! Dstat could at least tell you if your problem is CPU or I/O. Even better, run vmstat 2 10 Look at the first two columns. What column have higher numbers? If r, you're CPU-bound. If b, you're I/O bound. If you're I/O bound, I suggest you use atop to determine which processes take disk time. You can also use iostat -x 2 10. I really suggest you read on vmstat and iostat, they will always be helpful. Did you check if you have a defect disk or a rebuilding array? That could be the cause. Regards,