[CentOS] Help in troubleshoot cause of high kernel activity

Sat Mar 29 09:38:42 UTC 2008
Noob Centos Admin <centos.admin at gmail.com>

Hi, I had been experiencing a problem on our dedicated server running Centos
5, and unable to successfully track down the problem.

Since about 6 days ago, I noticed a spike in load/CPU utilization which went
from a typical 0.2x-0.3x to 3.x. At the same time, average traffic also went
up and so did the log usage. Prior to this, the server was working fine and
there had been no changes to the configuration.

Initially, I narrowed it down to the mail system. Exim was generating
significantly more log data than usual. This was eventually narrowed down to
apparently our server and another server playing ping pong between two users
who coincidentally were on vacation and had both their mailboxes filled.
Thus it caused an endless loop of "Message Undelivered" and "Auto-reply".

Once this was identified and cleared up, I had expected things to go back to
normal. However, load/traffic remained high.

Looking at "top" output, I noted that %sys was as high and often much higher
than %user. However, individual process %CPU just didn't add up to the total
top was reporting. Top reports 160~170 sleeping tasks and only 4 active most
of the time, which was largely exim then httpd/mysql/php.

top Snapshot
==========
top - 17:25:03 up 7 days, 19:16,  1 user,  load average: 2.03, 2.84, 3.04
Tasks: 168 total,   4 running, 164 sleeping,   0 stopped,   0 zombie
Cpu(s): 26.5%us, 50.3%sy,  0.0%ni, 16.6%id,  6.1%wa,  0.0%hi,  0.5%si,
0.0%st
Mem:   1915208k total,  1880256k used,    34952k free,   142100k buffers
Swap: 16777208k total,    66140k used, 16711068k free,  1276564k cached


iostat Snapshot
============
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               18.96    0.00   25.57       5.16       0.01     50.30

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              54.19        63.31      2460.80   42689802 1659234904
sdb              55.12        76.41      2460.80   51521720 1659234904
md1             315.95       139.72      2442.00   94207644 1646554216
md0               0.01         0.00         0.02       1422      14736
dm-0             39.13        65.85       292.50   44399402  197219496
dm-1            267.18        36.18      2110.08   24398010 1422756072
dm-2              9.64        37.68        39.42   25408576   26578648
fd0               0.00         0.00         0.00         16          0
sr0               0.00         0.00         0.00        136          0

Searching around for ways to interpret the output, I tried sar/iostat and
essentially, the information off the net indicates there wasn't a disk
problem, %io was relatively low and mdadm shows the RAID 1 disks working
perfectly fine. Since %sys is consistently highest, it appears that the
kernel was doing something outside of norm.

The problem is I have no idea what else to do to determine what "something"
is.

I've looked at netstat and there doesn't appear to be excessive connections,
logwatch summary also does not appear to give any clue as there are no
records of unusual failed log in attempts.

Please advise what else can I look into or check. Thanks in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20080329/92a044c9/attachment-0003.html>