[CentOS] bizarre system slowness

Wed Apr 13 20:06:06 UTC 2011
Florin Andrei <florin at andrei.myip.org>

Running v5 64bit on a Dell 1950.

A cluster of 3 DB machines, identical hardware. One of them suddenly 
became slower 2 weeks ago.

tar -zxf with a large file on this machine takes 1.5 minutes, but takes 
only 10 seconds on any of its siblings. CPU usage seems high while 
untarring, with lots of user and sys cycles being used, but almost no 
wait cycles. It doesn't matter whether I untar on a local disk, or on a 
fiber channel SAN volume, it's slow anyway.

scp a file over the network is slow too: 6 MB/s to this machine, 70 MB/s 
to its siblings.

However, this is just as fast on all systems, including the "sick" one:

# time dd if=/dev/zero of=/dev/null bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 2.59213 seconds, 40.5 GB/s

real	0m2.600s
user	0m0.025s
sys	0m2.550s

/proc/cpuinfo looks fine. Nothing suspect in dmesg.

Reboot doesn't fix it. Power off / power on doesn't fix it. Single mode 
is slow too, and I tried a couple different kernels.

Dell's online diagnostics program could find nothing wrong with it.

/var/log/messages was full of "ntpd[7313]: frequency error -1707 PPM 
exceeds tolerance 500 PPM" messages. There was a lot of messages about 
"the system limit for the maximum number of semaphore sets has been 
exceeded"; there was indeed a lot of leftover semaphores created by NRPE 
(owned by the nagios user); I deleted them but nothing has changed, so 
they were a symptom, not the cause.

I'm still kind of hoping it's a software issue, but chances are slim. 
OTOH, I can't imagine any hardware problem that would exhibit these 
symptoms.

Any idea what to test?

-- 
Florin Andrei
http://florin.myip.org/