[CentOS] bizarre system slowness

Wed Apr 13 20:34:54 UTC 2011
Cal Webster <cwebster at ec.rr.com>

On Wed, 2011-04-13 at 13:06 -0700, Florin Andrei wrote:
> Running v5 64bit on a Dell 1950.
> 
> A cluster of 3 DB machines, identical hardware. One of them suddenly 
> became slower 2 weeks ago.
> 
> tar -zxf with a large file on this machine takes 1.5 minutes, but takes 
> only 10 seconds on any of its siblings. CPU usage seems high while 
> untarring, with lots of user and sys cycles being used, but almost no 
> wait cycles. It doesn't matter whether I untar on a local disk, or on a 
> fiber channel SAN volume, it's slow anyway.
> 
> scp a file over the network is slow too: 6 MB/s to this machine, 70 MB/s 
> to its siblings.
> 
> However, this is just as fast on all systems, including the "sick" one:
> 
> # time dd if=/dev/zero of=/dev/null bs=1M count=100000
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 2.59213 seconds, 40.5 GB/s
> 
> real	0m2.600s
> user	0m0.025s
> sys	0m2.550s
> 
> /proc/cpuinfo looks fine. Nothing suspect in dmesg.
> 
> Reboot doesn't fix it. Power off / power on doesn't fix it. Single mode 
> is slow too, and I tried a couple different kernels.
> 
> Dell's online diagnostics program could find nothing wrong with it.
> 
> /var/log/messages was full of "ntpd[7313]: frequency error -1707 PPM 
> exceeds tolerance 500 PPM" messages. There was a lot of messages about 
> "the system limit for the maximum number of semaphore sets has been 
> exceeded"; there was indeed a lot of leftover semaphores created by NRPE 
> (owned by the nagios user); I deleted them but nothing has changed, so 
> they were a symptom, not the cause.

Are the system times different between the siblings?
Are all 3 siblings running ntpd and using the same time source
(server(s))?
Do the symptoms change with ntpd stopped/running?
Are the frequency offsets the same on each sibling?

Since your log messages appear to be ntp related, you might try
resetting your frequency offset and drift values. Having a -1707 PPM
offset could cause many issues like you describe.

service ntpd stop
ntptime -f 0
echo "0" > /var/lib/ntp/drift
service ntpd start


> I'm still kind of hoping it's a software issue, but chances are slim. 
> OTOH, I can't imagine any hardware problem that would exhibit these 
> symptoms.
> 
> Any idea what to test?
>