Hi, I had been experiencing a problem on our dedicated server running Centos 5, and unable to successfully track down the problem.
Since about 6 days ago, I noticed a spike in load/CPU utilization which went from a typical 0.2x-0.3x to 3.x. At the same time, average traffic also went up and so did the log usage. Prior to this, the server was working fine and there had been no changes to the configuration.
Initially, I narrowed it down to the mail system. Exim was generating significantly more log data than usual. This was eventually narrowed down to apparently our server and another server playing ping pong between two users who coincidentally were on vacation and had both their mailboxes filled. Thus it caused an endless loop of "Message Undelivered" and "Auto-reply".
Once this was identified and cleared up, I had expected things to go back to normal. However, load/traffic remained high.
Looking at "top" output, I noted that %sys was as high and often much higher than %user. However, individual process %CPU just didn't add up to the total top was reporting. Top reports 160~170 sleeping tasks and only 4 active most of the time, which was largely exim then httpd/mysql/php.
top Snapshot ========== top - 17:25:03 up 7 days, 19:16, 1 user, load average: 2.03, 2.84, 3.04 Tasks: 168 total, 4 running, 164 sleeping, 0 stopped, 0 zombie Cpu(s): 26.5%us, 50.3%sy, 0.0%ni, 16.6%id, 6.1%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 1915208k total, 1880256k used, 34952k free, 142100k buffers Swap: 16777208k total, 66140k used, 16711068k free, 1276564k cached
iostat Snapshot ============ avg-cpu: %user %nice %system %iowait %steal %idle 18.96 0.00 25.57 5.16 0.01 50.30
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 54.19 63.31 2460.80 42689802 1659234904 sdb 55.12 76.41 2460.80 51521720 1659234904 md1 315.95 139.72 2442.00 94207644 1646554216 md0 0.01 0.00 0.02 1422 14736 dm-0 39.13 65.85 292.50 44399402 197219496 dm-1 267.18 36.18 2110.08 24398010 1422756072 dm-2 9.64 37.68 39.42 25408576 26578648 fd0 0.00 0.00 0.00 16 0 sr0 0.00 0.00 0.00 136 0
Searching around for ways to interpret the output, I tried sar/iostat and essentially, the information off the net indicates there wasn't a disk problem, %io was relatively low and mdadm shows the RAID 1 disks working perfectly fine. Since %sys is consistently highest, it appears that the kernel was doing something outside of norm.
The problem is I have no idea what else to do to determine what "something" is.
I've looked at netstat and there doesn't appear to be excessive connections, logwatch summary also does not appear to give any clue as there are no records of unusual failed log in attempts.
Please advise what else can I look into or check. Thanks in advance!
Noob Centos Admin wrote:
Hi, I had been experiencing a problem on our dedicated server running Centos 5, and unable to successfully track down the problem.
Since about 6 days ago, I noticed a spike in load/CPU utilization which went from a typical 0.2x-0.3x to 3.x. At the same time, average traffic also went up and so did the log usage. Prior to this, the server was working fine and there had been no changes to the configuration.
Initially, I narrowed it down to the mail system. Exim was generating significantly more log data than usual. This was eventually narrowed down to apparently our server and another server playing ping pong between two users who coincidentally were on vacation and had both their mailboxes filled. Thus it caused an endless loop of "Message Undelivered" and "Auto-reply".
Once this was identified and cleared up, I had expected things to go back to normal. However, load/traffic remained high.
Looking at "top" output, I noted that %sys was as high and often much higher than %user. However, individual process %CPU just didn't add up to the total top was reporting. Top reports 160~170 sleeping tasks and only 4 active most of the time, which was largely exim then httpd/mysql/php.
top Snapshot
top - 17:25:03 up 7 days, 19:16, 1 user, load average: 2.03, 2.84, 3.04 Tasks: 168 total, 4 running, 164 sleeping, 0 stopped, 0 zombie Cpu(s): 26.5%us, 50.3%sy, 0.0%ni, 16.6%id, 6.1%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 1915208k total, 1880256k used, 34952k free, 142100k buffers Swap: 16777208k total, 66140k used, 16711068k free, 1276564k cached
iostat Snapshot
avg-cpu: %user %nice %system %iowait %steal %idle 18.96 0.00 25.57 5.16 0.01 50.30
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 54.19 63.31 2460.80 42689802 1659234904 sdb 55.12 76.41 2460.80 51521720 1659234904 md1 315.95 139.72 2442.00 94207644 1646554216 md0 0.01 0.00 0.02 1422 14736 dm-0 39.13 65.85 292.50 44399402 197219496 dm-1 267.18 36.18 2110.08 24398010 1422756072 dm-2 9.64 37.68 39.42 25408576 26578648 fd0 0.00 0.00 0.00 16 0 sr0 0.00 0.00 0.00 136 0
Searching around for ways to interpret the output, I tried sar/iostat and essentially, the information off the net indicates there wasn't a disk problem, %io was relatively low and mdadm shows the RAID 1 disks working perfectly fine. Since %sys is consistently highest, it appears that the kernel was doing something outside of norm.
The problem is I have no idea what else to do to determine what "something" is.
I've looked at netstat and there doesn't appear to be excessive connections, logwatch summary also does not appear to give any clue as there are no records of unusual failed log in attempts.
Please advise what else can I look into or check. Thanks in advance!
Well .. top says you have 4 processes running ... if that is consistent (4 processes always in a run state) then you should be able to determine the running processes with the command:
ps -ef r
(I think)
I would think one of always running processes is the one that is taking up CPU time.
Also while in top, <Shift>-H might show some hidden threads in the output.
Maybe those will help to find the processes that are taking the CPU time.
On Sat, Mar 29, 2008 at 6:37 PM, Johnny Hughes johnny@centos.org wrote:
Well .. top says you have 4 processes running ... if that is consistent (4 processes always in a run state) then you should be able to determine the running processes with the command:
ps -ef r
(I think)
I would think one of always running processes is the one that is taking up CPU time.
Also while in top, <Shift>-H might show some hidden threads in the output.
Thanks for the advise although I never got a chance to use it.
For some inexplicable Murphy-like reason, the server load went back to normal levels shortly after I sent off the email to the list.
The only possible explanation I could think of was that I killed the setroubleshootd process because it froze up after I tried to fiddle with the SELinux settings. There was some error in the log about unable to connect to the audit socket.
After observing the back to normal loads for a few hours to confirm it wasn't a momentarily drop, I restarted the setroubleshootd process and yet the load remain normal.
So my current uneducated guess is that the barrage of undeliverable email messages on the very first day caused SELinux to choke on a system/kernel level until the reporting daemon was killed to whatever was getting tied up to move on?