Find reason for heavy load

List overview All Threads
Download

newer

older

[OT] CAT5 IP-capable rackmount KVM...

Aide questions

Noob Centos Admin

30 Dec 2009 30 Dec '09

4:44 a.m.

My Centos 5 server has seen the average load jumped through the roof recently despite having no major additional clients placed on it. Previously, I was looking at an average of less than 0.6 load, I had a monitoring script that sends an email warning me if the current load stayed above 0.6 for more than 2 minutes. This script used to trigger perhaps once an hour during peak periods. Even so, I seldom see numbers higher than 1.x

On 4th Dec, somebody from an Indian IP range started hammering my SMTP service, attempting to use it as an open relay. Naturally that didn't work and only end up budging my typical 400KB daily log report into 2MB~4MB affairs.

After observing a few days to determine the IP range, I started blocking the Indian subnet with apf. Initially I had problems with getting apf to wok properly but after a couple of days managed to get the block working and my daily log went back down to expected size when all those connection attempts disappear from exim's log.

Now this is when my server load started to shoot through the roof with figures like 8.64 5.90 3.62 being reported by my monitoring script, triggering so often. I had to raise my threshold to 1.6 to keep my own script from spamming myself.

I've tried changing several things on the server, since initially it seems like the high load may be due to I/O wait. So I turning off non-essential services like OpenNMS to see if that had any effect. I also turned off apf and inserted rules manually into iptables to reduce the number of iptable rules the system has to process.

All that doesn't seem to help much, I'm still getting consistent server loads in the 2.x to 3.x range almost all the time.

The problem is using top, none of my processes are showing abnormal CPU%, most are well under 5%, manually adding them up doesn't equate the 200% to 300% the load figures of 2.x and 3.x are indicating.

Even top's own summary says CPU % is in the 20~30% range, what's worrying is the System% is also in the same range. I have no idea what is "system" doing since it appears that anything running inside the kernel is lumped under "system". Or why even totalling both % up, I would expect 50~60% to translate to the expected load of 0.5~0.6 yet system load stats is 5x what's expected.

I've installed utilities like dstat to try to see if I can figure out which process is making the system calls that is clogging up the server but either I don't understand it or it's not the right tool.

So I'll appreciate some advice on how/what should I do next to identify the cause. Thanks in advance!

Attachments:

attachment.html (text/html — 2.7 KB)

Show replies by date

John R Pierce

30 Dec 30 Dec

4:59 a.m.

Noob Centos Admin wrote:

...

My Centos 5 server has seen the average load jumped through the roof recently despite having no major additional clients placed on it. Previously, I was looking at an average of less than 0.6 load, I had a monitoring script that sends an email warning me if the current load stayed above 0.6 for more than 2 minutes. This script used to trigger perhaps once an hour during peak periods. Even so, I seldom see numbers higher than 1.x

On 4th Dec, somebody from an Indian IP range started hammering my SMTP service, attempting to use it as an open relay. Naturally that didn't work and only end up budging my typical 400KB daily log report into 2MB~4MB affairs.

After observing a few days to determine the IP range, I started blocking the Indian subnet with apf. Initially I had problems with getting apf to wok properly but after a couple of days managed to get the block working and my daily log went back down to expected size when all those connection attempts disappear from exim's log.

Now this is when my server load started to shoot through the roof with figures like 8.64 5.90 3.62 being reported by my monitoring script, triggering so often. I had to raise my threshold to 1.6 to keep my own script from spamming myself.

I've tried changing several things on the server, since initially it seems like the high load may be due to I/O wait. So I turning off non-essential services like OpenNMS to see if that had any effect. I also turned off apf and inserted rules manually into iptables to reduce the number of iptable rules the system has to process.

All that doesn't seem to help much, I'm still getting consistent server loads in the 2.x to 3.x range almost all the time.

The problem is using top, none of my processes are showing abnormal CPU%, most are well under 5%, manually adding them up doesn't equate the 200% to 300% the load figures of 2.x and 3.x are indicating.

Even top's own summary says CPU % is in the 20~30% range, what's worrying is the System% is also in the same range. I have no idea what is "system" doing since it appears that anything running inside the kernel is lumped under "system". Or why even totalling both % up, I would expect 50~60% to translate to the expected load of 0.5~0.6 yet system load stats is 5x what's expected.

I've installed utilities like dstat to try to see if I can figure out which process is making the system calls that is clogging up the server but either I don't understand it or it's not the right tool.

So I'll appreciate some advice on how/what should I do next to identify the cause. Thanks in advance!

last time I saw something like that, it was a bunch of chinese 'bots' hammering on my public services like ssh. another admin had turned pop3 on too, this created a very heavy load yet they didn't show up in top (bunches of pop3 and ssh processes showed up in ps -auxww, however, plug netstat -an

Noob Centos Admin

5:56 a.m.

Hi,

...

last time I saw something like that, it was a bunch of chinese 'bots' hammering on my public services like ssh. another admin had turned pop3 on too, this created a very heavy load yet they didn't show up in top (bunches of pop3 and ssh processes showed up in ps -auxww, however, plug netstat -an

Unfortunately the server is meant for web/email purposes so I can't turn off pop3/smtp. Naturally ps shows up a lot of httpd/mysql & exim/dovecot processes but a cursory glance doesn't see any suspicious IPs.

Similarly, I did a quick look at netstat -an and most of the IP are from local ISP that my clients are using.

One thing that occurred to me is, does using iptables to block smtp attempt uses more "system" resources as opposed to letting the bot flood my smtp logs with pointless attempts? :)

Ross Walker

5:55 a.m.

On Dec 29, 2009, at 11:44 PM, Noob Centos Admin centos.admin@gmail.com wrote:

...

My Centos 5 server has seen the average load jumped through the roof recently despite having no major additional clients placed on it. Previously, I was looking at an average of less than 0.6 load, I had a monitoring script that sends an email warning me if the current load stayed above 0.6 for more than 2 minutes. This script used to trigger perhaps once an hour during peak periods. Even so, I seldom see numbers higher than 1.x

On 4th Dec, somebody from an Indian IP range started hammering my SMTP service, attempting to use it as an open relay. Naturally that didn't work and only end up budging my typical 400KB daily log report into 2MB~4MB affairs.

After observing a few days to determine the IP range, I started blocking the Indian subnet with apf. Initially I had problems with getting apf to wok properly but after a couple of days managed to get the block working and my daily log went back down to expected size when all those connection attempts disappear from exim's log.

Now this is when my server load started to shoot through the roof with figures like 8.64 5.90 3.62 being reported by my monitoring script, triggering so often. I had to raise my threshold to 1.6 to keep my own script from spamming myself.

I've tried changing several things on the server, since initially it seems like the high load may be due to I/O wait. So I turning off non-essential services like OpenNMS to see if that had any effect. I also turned off apf and inserted rules manually into iptables to reduce the number of iptable rules the system has to process.

All that doesn't seem to help much, I'm still getting consistent server loads in the 2.x to 3.x range almost all the time.

The problem is using top, none of my processes are showing abnormal CPU%, most are well under 5%, manually adding them up doesn't equate the 200% to 300% the load figures of 2.x and 3.x are indicating.

Even top's own summary says CPU % is in the 20~30% range, what's worrying is the System% is also in the same range. I have no idea what is "system" doing since it appears that anything running inside the kernel is lumped under "system". Or why even totalling both % up, I would expect 50~60% to translate to the expected load of 0.5~0.6 yet system load stats is 5x what's expected.

I've installed utilities like dstat to try to see if I can figure out which process is making the system calls that is clogging up the server but either I don't understand it or it's not the right tool.

So I'll appreciate some advice on how/what should I do next to identify the cause. Thanks in advance!

Try blocking the IPs on the router and see if that helps.

You can also run iostat and look at the disk usage which also generates load.

How many cores does your machine have? Load avg is calculated for a single core, so a quad core would reach 100% utilization at a load of 4, but high iowaits can generate an artificially high load avg as well (and why one sees greater than 100% utilization).

I really wish load would be broken down as CPU/memory/disk instead of the ambiguous load avg, and show network read/write utilization in ifconfig.

-Ross

Noob Centos Admin

6:05 a.m.

Hi,

...

Try blocking the IPs on the router and see if that helps.

Unfortunately the server's in a DC so the router is not under our control.

...

You can also run iostat and look at the disk usage which also generates load.

I did try iostat and its iowait% did coincide with top's report, which is basically in the low 1~2%.

However, iostat reports much lower %user and $system compared to top running at the same time so I'm not quite sure if I can rely on its figures.

...

How many cores does your machine have? Load avg is calculated for a single core, so a quad core would reach 100% utilization at a load of 4, but high iowaits can generate an artificially high load avg as well (and why one sees greater than 100% utilization).

It's a dual core that's why I was getting concerned since loads above 2.0 would imply the system's processing capacity was apparently maxed. However, load and percentages don't add up.

For example, now I'm seeing top - 14:04:30 up 171 days, 7:14, 1 user, load average: 3.33, 3.97, 3.81 Tasks: 246 total, 2 running, 236 sleeping, 0 stopped, 8 zombie Cpu(s): 13.3%us, 16.0%sy, 0.0%ni, 67.5%id, 3.0%wa, 0.0%hi, 0.2%si, 0.0%st

iostat Linux 2.6.18-128.1.16.el5xen 12/30/2009 avg-cpu: %user %nice %system %iowait %steal %idle 3.28 0.20 1.16 2.38 0.01 92.97

...

I really wish load would be broken down as CPU/memory/disk instead of the ambiguous load avg, and show network read/write utilization in ifconfig.

Totally agreed. All the load number is doing is telling me something is using up resources somewhere but not a single clue otherwise! Confusing, frustrating and worrying at the same time :(

John R Pierce

6:21 a.m.

Noob Centos Admin wrote:

...

However, iostat reports much lower %user and $system compared to top running at the same time so I'm not quite sure if I can rely on its figures. ... iostat Linux 2.6.18-128.1.16.el5xen 12/30/2009 avg-cpu: %user %nice %system %iowait %steal %idle 3.28 0.20 1.16 2.38 0.01 92.97

iostat, if run with no parameters shows the average since reboot or statistics reset.

run `iostat -x 5` to a) show details on all devices, and B) show 5 second samples. ignore the first output as thats average. the 2nd and beyond outputs represent 5 second samples.

note, btw, 'load average' isn't CPU usage, its the number of processes that are waiting to run. a load average of 8 means there are 8 processes waiting to use system resources. this does include processes in iowait, but doesn't include processes that are sleeping on semaphores and such, so it can be quite a lot higher than the cpu workload.

Ross Walker

6:24 a.m.

On Dec 30, 2009, at 1:05 AM, Noob Centos Admin centos.admin@gmail.com wrote:

...

Hi,

...
Try blocking the IPs on the router and see if that helps.

Unfortunately the server's in a DC so the router is not under our control.

That sucks, oh well.

...

...
You can also run iostat and look at the disk usage which also generates load.

I did try iostat and its iowait% did coincide with top's report, which is basically in the low 1~2%.

However, iostat reports much lower %user and $system compared to top running at the same time so I'm not quite sure if I can rely on its figures.

Yes, I'm not sure iostat's CPU numbers represent the full CPU utilization, or only the CPU utilization for IO.

...

...
How many cores does your machine have? Load avg is calculated for a single core, so a quad core would reach 100% utilization at a load of 4, but high iowaits can generate an artificially high load avg as well (and why one sees greater than 100% utilization).

It's a dual core that's why I was getting concerned since loads above 2.0 would imply the system's processing capacity was apparently maxed. However, load and percentages don't add up.

They never do because of the time scaled averages.

...

For example, now I'm seeing top - 14:04:30 up 171 days, 7:14, 1 user, load average: 3.33, 3.97, 3.81 Tasks: 246 total, 2 running, 236 sleeping, 0 stopped, 8 zombie Cpu(s): 13.3%us, 16.0%sy, 0.0%ni, 67.5%id, 3.0%wa, 0.0%hi, 0.2%si, 0.0%st

iostat Linux 2.6.18-128.1.16.el5xen 12/30/2009 avg-cpu: %user %nice %system %iowait %steal %idle 3.28 0.20 1.16 2.38 0.01 92.97

...
I really wish load would be broken down as CPU/memory/disk instead of the ambiguous load avg, and show network read/write utilization in ifconfig.

Totally agreed. All the load number is doing is telling me something is using up resources somewhere but not a single clue otherwise! Confusing, frustrating and worrying at the same time :(

Maybe someone could write a command-line utility that outputs the system load broken down into CPU/memory/disk/network. Call it 'sysload' and take the system configuration into account.

Take a look at your iptables setup, make sure the blocked ip rules are checked first before any other and drop the packets without any icmp (give em a black hole to stare at).

-Ross

Christoph Maser

7:16 a.m.

Am Mittwoch, den 30.12.2009, 05:44 +0100 schrieb Noob Centos Admin:

...

since initially it seems like the high load may be due to I/O wait

Maybe this will help you to identify the IO loading process:

http://dag.wieers.com/blog/red-hat-backported-io-accounting-to-rhel5

Chris

financial.com AG

Munich head office/Hauptsitz München: Maria-Probst-Str. 19 | 80939 München | Germany Frankfurt branch office/Niederlassung Frankfurt: Messeturm | Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. Yann Samson | Matthias Wiederwach Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden (chairman/Vorsitzender) Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID number/St.Nr.: DE205 370 553

Noob Centos Admin

31 Dec 31 Dec

9:14 a.m.

Hi,

...

...
since initially it seems like the high load may be due to I/O wait

Maybe this will help you to identify the IO loading process:

http://dag.wieers.com/blog/red-hat-backported-io-accounting-to-rhel5

Thanks for the suggestion, I did install dstat earlier while trying to figure things out on my own. However, I think my kernel being the older version does not support the latest feature the website was pointing out. Given that it's a live server not within physical touch, I'm a little wary of doing kernel updates that might just kill it :D

I'll try other methods first and see if they help, if not, I'll probably have to bite the bullet and do it over a weekend where I get more time to repair any inadvertent damage.

Thomas Harold

30 Dec 30 Dec

5:09 p.m.

On 12/29/2009 11:44 PM, Noob Centos Admin wrote:

...

My Centos 5 server has seen the average load jumped through the roof recently despite having no major additional clients placed on it. Previously, I was looking at an average of less than 0.6 load, I had a monitoring script that sends an email warning me if the current load stayed above 0.6 for more than 2 minutes. This script used to trigger perhaps once an hour during peak periods. Even so, I seldom see numbers higher than 1.x

You should also try out "atop" instead of just using top. The major advantage is that it gives you more information about the disk and network utilization.

Noob Centos Admin

31 Dec 31 Dec

9:14 a.m.

Hi,

...

You should also try out "atop" instead of just using top. The major advantage is that it gives you more information about the disk and network utilization.

Thanks for the tip, I tried it and if the red lines are any indication, it seems that atop thinks my disks (md raid 1) are the problem being busy over 60~70% of the time. However that is sort of expected since most of the expected activity on the server is smtp/pop3.

Unfortunately, I did not know about atop previously and don't have a baseline to compare against :(

Ugo Bellavance

30 Dec 30 Dec

9:44 p.m.

On 2009-12-29 23:44, Noob Centos Admin wrote:

...

My Centos 5 server has seen the average load jumped through the roof recently despite having no major additional clients placed on it. Previously, I was looking at an average of less than 0.6 load, I had a monitoring script that sends an email warning me if the current load stayed above 0.6 for more than 2 minutes. This script used to trigger perhaps once an hour during peak periods. Even so, I seldom see numbers higher than 1.x

On 4th Dec, somebody from an Indian IP range started hammering my SMTP service, attempting to use it as an open relay. Naturally that didn't work and only end up budging my typical 400KB daily log report into 2MB~4MB affairs.

After observing a few days to determine the IP range, I started blocking the Indian subnet with apf. Initially I had problems with getting apf to wok properly but after a couple of days managed to get the block working and my daily log went back down to expected size when all those connection attempts disappear from exim's log.

Now this is when my server load started to shoot through the roof with figures like 8.64 5.90 3.62 being reported by my monitoring script, triggering so often. I had to raise my threshold to 1.6 to keep my own script from spamming myself.

I've tried changing several things on the server, since initially it seems like the high load may be due to I/O wait. So I turning off non-essential services like OpenNMS to see if that had any effect. I also turned off apf and inserted rules manually into iptables to reduce the number of iptable rules the system has to process.

All that doesn't seem to help much, I'm still getting consistent server loads in the 2.x to 3.x range almost all the time.

The problem is using top, none of my processes are showing abnormal CPU%, most are well under 5%, manually adding them up doesn't equate the 200% to 300% the load figures of 2.x and 3.x are indicating.

Even top's own summary says CPU % is in the 20~30% range, what's worrying is the System% is also in the same range. I have no idea what is "system" doing since it appears that anything running inside the kernel is lumped under "system". Or why even totalling both % up, I would expect 50~60% to translate to the expected load of 0.5~0.6 yet system load stats is 5x what's expected.

I've installed utilities like dstat to try to see if I can figure out which process is making the system calls that is clogging up the server but either I don't understand it or it's not the right tool.

So I'll appreciate some advice on how/what should I do next to identify the cause. Thanks in advance!

Dstat could at least tell you if your problem is CPU or I/O.

Even better, run

vmstat 2 10

Look at the first two columns. What column have higher numbers? If r, you're CPU-bound. If b, you're I/O bound.

If you're I/O bound, I suggest you use atop to determine which processes take disk time.

You can also use iostat -x 2 10.

I really suggest you read on vmstat and iostat, they will always be helpful.

Did you check if you have a defect disk or a rebuilding array? That could be the cause.

Regards,

Noob Centos Admin

31 Dec 31 Dec

9:26 a.m.

Hi,

...

Dstat could at least tell you if your problem is CPU or I/O.

This was the result of running the following command which I obtained from reading up about two weeks ago when I started trying to investigate the abnormal server behaviour.

...

Even better, run

vmstat 2 10

Look at the first two columns. What column have higher numbers? If r, you're CPU-bound. If b, you're I/O bound.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 8 1 3092 131460 100692 833668 0 0 40 21 1 0 4 1 92 2 0 9 1 3092 130708 100700 835016 0 0 578 206 577 1420 32 50 3 15 0 7 1 3092 128324 100716 836148 0 0 546 2866 594 1465 31 44 7 18 0 4 1 3092 126860 100724 837268 0 0 540 256 596 1505 28 43 6 23 0 7 2 3092 125600 100740 838564 0 0 620 234 661 1442 30 41 2 26 0 5 1 3092 124028 100756 839752 0 0 570 2692 635 1430 24 45 6 25 0 6 0 3092 122040 100784 840964 0 0 584 1464 682 1434 27 44 2 28 0 6 1 3092 120588 100792 842232 0 0 602 278 624 1562 32 46 2 20 0 2 3 3092 120556 100840 843064 0 0 440 2908 603 1299 22 35 6 37 0 3 1 3092 119832 100876 844088 0 0 430 1104 605 1348 23 36 1 40 0

According to this, am I correct to conclude that I'm CPU bound and the system is busy doing some unknown processing?

...

Did you check if you have a defect disk or a rebuilding array? That could be the cause.

I usually run a "cat /proc/mdstat" whenever I log into the server to check my MD raid status. So far the array appears ok. There are no disk warning when I run "dmesg". smartctl also reports no error logged and passed for both disks, although no self test was ran. Would I be safe to conclude that the disks are OK and not part of the problem?

Thanks again to everybody for the suggestions and help so far.

Chan Chung Hang Christopher

11:34 a.m.

...

...
Look at the first two columns. What column have higher numbers? If r, you're CPU-bound. If b, you're I/O bound.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 8 1 3092 131460 100692 833668 0 0 40 21 1 0 4 1 92 2 0 9 1 3092 130708 100700 835016 0 0 578 206 577 1420 32 50 3 15 0 7 1 3092 128324 100716 836148 0 0 546 2866 594 1465 31 44 7 18 0 4 1 3092 126860 100724 837268 0 0 540 256 596 1505 28 43 6 23 0 7 2 3092 125600 100740 838564 0 0 620 234 661 1442 30 41 2 26 0 5 1 3092 124028 100756 839752 0 0 570 2692 635 1430 24 45 6 25 0 6 0 3092 122040 100784 840964 0 0 584 1464 682 1434 27 44 2 28 0 6 1 3092 120588 100792 842232 0 0 602 278 624 1562 32 46 2 20 0 2 3 3092 120556 100840 843064 0 0 440 2908 603 1299 22 35 6 37 0 3 1 3092 119832 100876 844088 0 0 430 1104 605 1348 23 36 1 40 0

According to this, am I correct to conclude that I'm CPU bound and the system is busy doing some unknown processing?

Yes, these figures indicate that you are fairly close to being cpu bound.

What kind of filtering are you doing? If you have any connection tracking/state related rules set, you will need to be using a fair amount of cpu.

Christoph Maser

11:51 a.m.

Am Donnerstag, den 31.12.2009, 12:34 +0100 schrieb Chan Chung Hang Christopher:

...

...
...
Look at the first two columns. What column have higher numbers? If r, you're CPU-bound. If b, you're I/O bound.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 8 1 3092 131460 100692 833668 0 0 40 21 1 0 4 1 92 2 0 9 1 3092 130708 100700 835016 0 0 578 206 577 1420 32 50 3 15 0 7 1 3092 128324 100716 836148 0 0 546 2866 594 1465 31 44 7 18 0 4 1 3092 126860 100724 837268 0 0 540 256 596 1505 28 43 6 23 0 7 2 3092 125600 100740 838564 0 0 620 234 661 1442 30 41 2 26 0 5 1 3092 124028 100756 839752 0 0 570 2692 635 1430 24 45 6 25 0 6 0 3092 122040 100784 840964 0 0 584 1464 682 1434 27 44 2 28 0 6 1 3092 120588 100792 842232 0 0 602 278 624 1562 32 46 2 20 0 2 3 3092 120556 100840 843064 0 0 440 2908 603 1299 22 35 6 37 0 3 1 3092 119832 100876 844088 0 0 430 1104 605 1348 23 36 1 40 0

According to this, am I correct to conclude that I'm CPU bound and the system is busy doing some unknown processing?

Yes, these figures indicate that you are fairly close to being cpu bound.

Really? 20-30% user and ~40% sys/wait look more like I/O to mee.

Chris

financial.com AG

Chan Chung Hang Christopher

12:07 p.m.

Christoph Maser wrote:

...

Am Donnerstag, den 31.12.2009, 12:34 +0100 schrieb Chan Chung Hang Christopher:

...
...
...
Look at the first two columns. What column have higher numbers? If r, you're CPU-bound. If b, you're I/O bound.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 8 1 3092 131460 100692 833668 0 0 40 21 1 0 4 1 92 2 0 9 1 3092 130708 100700 835016 0 0 578 206 577 1420 32 50 3 15 0 7 1 3092 128324 100716 836148 0 0 546 2866 594 1465 31 44 7 18 0 4 1 3092 126860 100724 837268 0 0 540 256 596 1505 28 43 6 23 0 7 2 3092 125600 100740 838564 0 0 620 234 661 1442 30 41 2 26 0 5 1 3092 124028 100756 839752 0 0 570 2692 635 1430 24 45 6 25 0 6 0 3092 122040 100784 840964 0 0 584 1464 682 1434 27 44 2 28 0 6 1 3092 120588 100792 842232 0 0 602 278 624 1562 32 46 2 20 0 2 3 3092 120556 100840 843064 0 0 440 2908 603 1299 22 35 6 37 0 3 1 3092 119832 100876 844088 0 0 430 1104 605 1348 23 36 1 40 0

According to this, am I correct to conclude that I'm CPU bound and the system is busy doing some unknown processing?

Yes, these figures indicate that you are fairly close to being cpu bound.

Really? 20-30% user and ~40% sys/wait look more like I/O to mee.

user accounts for processing done by processes while sys accounts for processing done by the kernel (like netfilter) and idle tells you what is left. idle numbers are below 10 and near 0, that would be what I'd call nearly cpu bound. If he has high idle scores and high wa scores, then he'd be completely i/o bound.

The last line there, he got a idle score of 1 while wa was 40 which indicates that even though if there is some i/o waiting, it is not starving the cpus.

Noob Centos Admin

12:26 p.m.

Hi,

...

Yes, these figures indicate that you are fairly close to being cpu bound.

What kind of filtering are you doing? If you have any connection tracking/state related rules set, you will need to be using a fair amount of cpu.

Initially, when the load start going up, I had thought the APF filtering rules were the problem since the Indian fellow is still hammering away at the server even now. However, I've since taken the risk of turning off APF and rely on static iptables rules, which adds up to less than one screenful on SSH.

I also thought it might had to do with exim/spamassassin but making a few changes to reduce the number of emails that goes to spamd doesn't seem to be helping much.

In fact as you can see from the stats, load has gone up even further since. I've been averaging 10+ for the whole working day. At the moment it's between 6 to 10 when it should be at 0.3 from past months of logs.

This is despite the fact most of my clients should be out celebrating New Year's Eve. From weeks of logs, the Indian spammer is also a very punctual fellow who should have knock off work about 17 minutes ago. So there shouldn't be any heavy 'known' activities on the server at this point.

So I'm quite stumped as to what's chewing up the CPU cycles. I am also starting to worry if the server's been compromised and is now doing something I don't want it to be.

I'm probably going to shutdown the mail/httpd services after midnight when the impact is the least and see how the server reacts for a couple of minutes with everything else cut off.

Chan Chung Hang Christopher

3:28 p.m.

Noob Centos Admin wrote:

...

Hi,

...
Yes, these figures indicate that you are fairly close to being cpu bound.

What kind of filtering are you doing? If you have any connection tracking/state related rules set, you will need to be using a fair amount of cpu.

Initially, when the load start going up, I had thought the APF filtering rules were the problem since the Indian fellow is still hammering away at the server even now. However, I've since taken the risk of turning off APF and rely on static iptables rules, which adds up to less than one screenful on SSH.

I do not know about now but I had to unload the modules in question. Just clearing the rules was not enough to ensure that the netfilter connection tracking modules were not using any cpu at all.

...

I also thought it might had to do with exim/spamassassin but making a few changes to reduce the number of emails that goes to spamd doesn't seem to be helping much.

In fact as you can see from the stats, load has gone up even further since. I've been averaging 10+ for the whole working day. At the moment it's between 6 to 10 when it should be at 0.3 from past months of logs.

This is despite the fact most of my clients should be out celebrating New Year's Eve. From weeks of logs, the Indian spammer is also a very punctual fellow who should have knock off work about 17 minutes ago. So there shouldn't be any heavy 'known' activities on the server at this point.

/me shrugs. When I was the mta admin at Outblaze Ltd. (messaging business now owned by IBM and called Lotus Live) spammers always ensured I got called. All they do is just press the big red button (aka start the script/system) and then go and play while I would have to deal with whatever was started. I remember only one occasion when the spams were launched but neutralized very soon because they were pushing a website and I found a sample real early and so the anti spam system could just dump the spams and knock out accounts being used to send the crap.

...

So I'm quite stumped as to what's chewing up the CPU cycles. I am also starting to worry if the server's been compromised and is now doing something I don't want it to be.

I'm probably going to shutdown the mail/httpd services after midnight when the impact is the least and see how the server reacts for a couple of minutes with everything else cut off.

First, try rmmod'ing the netfilter modules after you have cleared away the state related rules to make sure that you are only using static rules in netfilter...unless you have done that already..

Noob Centos Admin

6:41 p.m.

Hi,

...

I do not know about now but I had to unload the modules in question. Just clearing the rules was not enough to ensure that the netfilter connection tracking modules were not using any cpu at all.

Thanks for pointing this out. Being a noob admin as my pseudonym states, I'd assumed stopping apf and restarting iptables was sufficient. I'll have to look up unloading module later.

...

/me shrugs. When I was the mta admin at Outblaze Ltd. (messaging business now owned by IBM and called Lotus Live) spammers always ensured I got called. All they do is just press the big red button (aka start the script/system) and then go and play while I would have to deal with whatever was started.

Based on the almost precise timing of around 9:30 to 5:30 India time, I'm inclined to think in my case it wasn't so much a spammer pressing a red button but a compromised machine in an office starting up when the user gets into office and knocks off on time at 5:30 :D

...

I remember only one occasion when the spams were launched but neutralized very soon because they were pushing a website and I found a sample real early and so the anti spam system could just dump the spams and knock out accounts being used to send the crap.

Could I ask how do I knock out the accounts sending the crap if they are not within my systems?

...

First, try rmmod'ing the netfilter modules after you have cleared away the state related rules to make sure that you are only using static rules in netfilter...unless you have done that already..

I think I'm only using static rules because after I restart iptables, I would then do a service iptables status to check my rules were in, and that list was very short compared to when APF was active.

The good news is, I think I've fixed the big problem after doing my shutdown tests and returned to the original problem.

Noob Centos Admin

7 p.m.

I initiated services shutdown as previously planned and once the external services like exim, dovecot, httpd, crond (because it kept restarting these services), the problem child stood out like a sore thumb.

There was two exim instances that didn't go away despite service exim stop. Once I killed these two PID, the load average started dropping rapidly. After a minute or so, the server went back to a happy 0.2~0.3 load and disk activity became almost negligible.

I think these, orphaned? zombied?, exim instances were related to a mail loop problem I discovered earlier today where one of my client on holiday had a full mailbox and keep bouncing mails from a contact whose site was suspended. Although I terminated that loop, it seemed that exim had gotten those two instances stuck in limbo sucking up processing power and hitting the disk somewhere unknown since they weren't showing up in my exim logs.

After observing a while, I brought the services back and once exim got started, my load went back to 2.x ~ 3.x. Unfortunately while I was typing this email, I realize it didn't stop there. I'm up to 4.x ~ 5.x load level by now.

So the application that is the cause of the load is definitely exim, more specifically I think it's spam assassin because now that the mail logs entries are slow, I can read the spamd details and mails are taking between 3 to 8 seconds to be checked.

Thanks again to everybody who had offer suggestions and advice and do have a Happy New Year :)

On 1/1/10, Noob Centos Admin centos.admin@gmail.com wrote:

...

Hi,

...
I do not know about now but I had to unload the modules in question. Just clearing the rules was not enough to ensure that the netfilter connection tracking modules were not using any cpu at all.

Thanks for pointing this out. Being a noob admin as my pseudonym states, I'd assumed stopping apf and restarting iptables was sufficient. I'll have to look up unloading module later.

...
/me shrugs. When I was the mta admin at Outblaze Ltd. (messaging business now owned by IBM and called Lotus Live) spammers always ensured I got called. All they do is just press the big red button (aka start the script/system) and then go and play while I would have to deal with whatever was started.

Based on the almost precise timing of around 9:30 to 5:30 India time, I'm inclined to think in my case it wasn't so much a spammer pressing a red button but a compromised machine in an office starting up when the user gets into office and knocks off on time at 5:30 :D

...
I remember only one occasion when the spams were launched but neutralized very soon because they were pushing a website and I found a sample real early and so the anti spam system could just dump the spams and knock out accounts being used to send the crap.

Could I ask how do I knock out the accounts sending the crap if they are not within my systems?

...
First, try rmmod'ing the netfilter modules after you have cleared away the state related rules to make sure that you are only using static rules in netfilter...unless you have done that already..

I think I'm only using static rules because after I restart iptables, I would then do a service iptables status to check my rules were in, and that list was very short compared to when APF was active.

The good news is, I think I've fixed the big problem after doing my shutdown tests and returned to the original problem.

Noob Centos Admin

8:13 p.m.

Just an concluding update to anybody who might be interested :)

My apologies for blaming spamassassin in the earlier email. It was taking so long because of the real problem.

Apparently the odd exim processes that was related to the mail loop problem I nipped was still the culprit. I had overlooked the fact that by the time I caught onto the mail loop issue, there were actually hundreds if not thousands of bounced and rebounced messages in the queue already. Attempting to deliver these messages queued before I terminated the mail loop was what those exim processes were trying to do.

This would had been ok if not for the other problem. The user apparently went on 2 week vacation since 15th and thought it was a good idea to enlarge his mailbox before doing so. So there was this 2.5GB mailbox choked full of both valid & rebounced mails, plus the queue of more rebounced mails. So every time exim attempted to add the queued mails to the user's account, the quota system rejected it. The cpu load was probably due to this never ending ping pong match between exim and the quota.

Yeah, I can't help but feel this must be such a noob mistake allowing that to develop without realizing it.

Now that I've purged the queue of those bounced messages and other housekeeping for that user, server load has finally gone back to the expected sub 1.0 levels so I can finally go and enjoy my holiday :)

On 1/1/10, Noob Centos Admin centos.admin@gmail.com wrote:

...

I initiated services shutdown as previously planned and once the external services like exim, dovecot, httpd, crond (because it kept restarting these services), the problem child stood out like a sore thumb.

There was two exim instances that didn't go away despite service exim stop. Once I killed these two PID, the load average started dropping rapidly. After a minute or so, the server went back to a happy 0.2~0.3 load and disk activity became almost negligible.

I think these, orphaned? zombied?, exim instances were related to a mail loop problem I discovered earlier today where one of my client on holiday had a full mailbox and keep bouncing mails from a contact whose site was suspended. Although I terminated that loop, it seemed that exim had gotten those two instances stuck in limbo sucking up processing power and hitting the disk somewhere unknown since they weren't showing up in my exim logs.

After observing a while, I brought the services back and once exim got started, my load went back to 2.x ~ 3.x. Unfortunately while I was typing this email, I realize it didn't stop there. I'm up to 4.x ~ 5.x load level by now.

So the application that is the cause of the load is definitely exim, more specifically I think it's spam assassin because now that the mail logs entries are slow, I can read the spamd details and mails are taking between 3 to 8 seconds to be checked.

Thanks again to everybody who had offer suggestions and advice and do have a Happy New Year :)

On 1/1/10, Noob Centos Admin centos.admin@gmail.com wrote:

...
Hi,

...
I do not know about now but I had to unload the modules in question. Just clearing the rules was not enough to ensure that the netfilter connection tracking modules were not using any cpu at all.

Thanks for pointing this out. Being a noob admin as my pseudonym states, I'd assumed stopping apf and restarting iptables was sufficient. I'll have to look up unloading module later.

...
/me shrugs. When I was the mta admin at Outblaze Ltd. (messaging business now owned by IBM and called Lotus Live) spammers always ensured I got called. All they do is just press the big red button (aka start the script/system) and then go and play while I would have to deal with whatever was started.

Based on the almost precise timing of around 9:30 to 5:30 India time, I'm inclined to think in my case it wasn't so much a spammer pressing a red button but a compromised machine in an office starting up when the user gets into office and knocks off on time at 5:30 :D

...
I remember only one occasion when the spams were launched but neutralized very soon because they were pushing a website and I found a sample real early and so the anti spam system could just dump the spams and knock out accounts being used to send the crap.

Could I ask how do I knock out the accounts sending the crap if they are not within my systems?

...
First, try rmmod'ing the netfilter modules after you have cleared away the state related rules to make sure that you are only using static rules in netfilter...unless you have done that already..

I think I'm only using static rules because after I restart iptables, I would then do a service iptables status to check my rules were in, and that list was very short compared to when APF was active.

The good news is, I think I've fixed the big problem after doing my shutdown tests and returned to the original problem.

Ugo Bellavance

9:14 p.m.

On 2009-12-31 15:13, Noob Centos Admin wrote:

...

Just an concluding update to anybody who might be interested :)

My apologies for blaming spamassassin in the earlier email. It was taking so long because of the real problem.

Apparently the odd exim processes that was related to the mail loop problem I nipped was still the culprit. I had overlooked the fact that by the time I caught onto the mail loop issue, there were actually hundreds if not thousands of bounced and rebounced messages in the queue already. Attempting to deliver these messages queued before I terminated the mail loop was what those exim processes were trying to do.

This would had been ok if not for the other problem. The user apparently went on 2 week vacation since 15th and thought it was a good idea to enlarge his mailbox before doing so. So there was this 2.5GB mailbox choked full of both valid& rebounced mails, plus the queue of more rebounced mails. So every time exim attempted to add the queued mails to the user's account, the quota system rejected it. The cpu load was probably due to this never ending ping pong match between exim and the quota.

Yeah, I can't help but feel this must be such a noob mistake allowing that to develop without realizing it.

Now that I've purged the queue of those bounced messages and other housekeeping for that user, server load has finally gone back to the expected sub 1.0 levels so I can finally go and enjoy my holiday :)

On 1/1/10, Noob Centos Admincentos.admin@gmail.com wrote:

...
I initiated services shutdown as previously planned and once the external services like exim, dovecot, httpd, crond (because it kept restarting these services), the problem child stood out like a sore thumb.

There was two exim instances that didn't go away despite service exim stop. Once I killed these two PID, the load average started dropping rapidly. After a minute or so, the server went back to a happy 0.2~0.3 load and disk activity became almost negligible.

I think these, orphaned? zombied?, exim instances were related to a mail loop problem I discovered earlier today where one of my client on holiday had a full mailbox and keep bouncing mails from a contact whose site was suspended. Although I terminated that loop, it seemed that exim had gotten those two instances stuck in limbo sucking up processing power and hitting the disk somewhere unknown since they weren't showing up in my exim logs.

After observing a while, I brought the services back and once exim got started, my load went back to 2.x ~ 3.x. Unfortunately while I was typing this email, I realize it didn't stop there. I'm up to 4.x ~ 5.x load level by now.

So the application that is the cause of the load is definitely exim, more specifically I think it's spam assassin because now that the mail logs entries are slow, I can read the spamd details and mails are taking between 3 to 8 seconds to be checked.

Thanks again to everybody who had offer suggestions and advice and do have a Happy New Year :)

On 1/1/10, Noob Centos Admincentos.admin@gmail.com wrote:

...
Hi,

...
I do not know about now but I had to unload the modules in question. Just clearing the rules was not enough to ensure that the netfilter connection tracking modules were not using any cpu at all.

Thanks for pointing this out. Being a noob admin as my pseudonym states, I'd assumed stopping apf and restarting iptables was sufficient. I'll have to look up unloading module later.

...
/me shrugs. When I was the mta admin at Outblaze Ltd. (messaging business now owned by IBM and called Lotus Live) spammers always ensured I got called. All they do is just press the big red button (aka start the script/system) and then go and play while I would have to deal with whatever was started.

Based on the almost precise timing of around 9:30 to 5:30 India time, I'm inclined to think in my case it wasn't so much a spammer pressing a red button but a compromised machine in an office starting up when the user gets into office and knocks off on time at 5:30 :D

...
I remember only one occasion when the spams were launched but neutralized very soon because they were pushing a website and I found a sample real early and so the anti spam system could just dump the spams and knock out accounts being used to send the crap.

Could I ask how do I knock out the accounts sending the crap if they are not within my systems?

...
First, try rmmod'ing the netfilter modules after you have cleared away the state related rules to make sure that you are only using static rules in netfilter...unless you have done that already..

I think I'm only using static rules because after I restart iptables, I would then do a service iptables status to check my rules were in, and that list was very short compared to when APF was active.

The good news is, I think I've fixed the big problem after doing my shutdown tests and returned to the original problem.

If you (and other people) have learned, it was worth it :).

Ugo

5694

Age (days ago)

5695

Last active (days ago)

discuss@lists.centos.org

21 comments

7 participants

tags (0)

participants (7)

Chan Chung Hang Christopher
Christoph Maser
John R Pierce
Noob Centos Admin
Ross Walker
Thomas Harold
Ugo Bellavance