[CentOS] Help in troubleshoot cause of high kernel activity

Noob Centos Admin wrote:
> Hi, I had been experiencing a problem on our dedicated server running Centos
> 5, and unable to successfully track down the problem.
> 
> Since about 6 days ago, I noticed a spike in load/CPU utilization which went
> from a typical 0.2x-0.3x to 3.x. At the same time, average traffic also went
> up and so did the log usage. Prior to this, the server was working fine and
> there had been no changes to the configuration.
> 
> Initially, I narrowed it down to the mail system. Exim was generating
> significantly more log data than usual. This was eventually narrowed down to
> apparently our server and another server playing ping pong between two users
> who coincidentally were on vacation and had both their mailboxes filled.
> Thus it caused an endless loop of "Message Undelivered" and "Auto-reply".
> 
> Once this was identified and cleared up, I had expected things to go back to
> normal. However, load/traffic remained high.
> 
> Looking at "top" output, I noted that %sys was as high and often much higher
> than %user. However, individual process %CPU just didn't add up to the total
> top was reporting. Top reports 160~170 sleeping tasks and only 4 active most
> of the time, which was largely exim then httpd/mysql/php.
> 
> top Snapshot
> ==========
> top - 17:25:03 up 7 days, 19:16,  1 user,  load average: 2.03, 2.84, 3.04
> Tasks: 168 total,   4 running, 164 sleeping,   0 stopped,   0 zombie
> Cpu(s): 26.5%us, 50.3%sy,  0.0%ni, 16.6%id,  6.1%wa,  0.0%hi,  0.5%si,
> 0.0%st
> Mem:   1915208k total,  1880256k used,    34952k free,   142100k buffers
> Swap: 16777208k total,    66140k used, 16711068k free,  1276564k cached
> 
> 
> iostat Snapshot
> ============
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                18.96    0.00   25.57       5.16       0.01     50.30
> 
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              54.19        63.31      2460.80   42689802 1659234904
> sdb              55.12        76.41      2460.80   51521720 1659234904
> md1             315.95       139.72      2442.00   94207644 1646554216
> md0               0.01         0.00         0.02       1422      14736
> dm-0             39.13        65.85       292.50   44399402  197219496
> dm-1            267.18        36.18      2110.08   24398010 1422756072
> dm-2              9.64        37.68        39.42   25408576   26578648
> fd0               0.00         0.00         0.00         16          0
> sr0               0.00         0.00         0.00        136          0
> 
> Searching around for ways to interpret the output, I tried sar/iostat and
> essentially, the information off the net indicates there wasn't a disk
> problem, %io was relatively low and mdadm shows the RAID 1 disks working
> perfectly fine. Since %sys is consistently highest, it appears that the
> kernel was doing something outside of norm.
> 
> The problem is I have no idea what else to do to determine what "something"
> is.
> 
> I've looked at netstat and there doesn't appear to be excessive connections,
> logwatch summary also does not appear to give any clue as there are no
> records of unusual failed log in attempts.
> 
> Please advise what else can I look into or check. Thanks in advance!

Well .. top says you have 4 processes running ... if that is consistent 
(4 processes always in a run state) then you should be able to determine 
the running processes with the command:

ps -ef r

(I think)

I would think one of always running processes is the one that is taking 
up CPU time.

Also while in top, <Shift>-H might show some hidden threads in the output.

Maybe those will help to find the processes that are taking the CPU time.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos/attachments/20080329/d4e6781a/attachment-0005.sig>