[CentOS] High load average, low CPU utilization

Thu Mar 27 22:20:22 UTC 2014
Matt Garman <matthew.garman at gmail.com>

I have a dual Xeon 5130 (four total CPUs) server running CentOS 5.7.
 Approximately every 17 hours, the load on this server slowly creeps up
until it hits 20, then slowly goes back down.

The most recent example started around 2:00am this morning.  Outside of
these weird times, the load never exceeds 2.0 (and in fact spends the
overwhelming majority of its time at 1.0).  So this morning, a few data
points:

    - 2:06 to 2:07 load increased from 1.0 to 2.0
    - At 2:09 it hit 4.0
    - At 2:10 it hit 5.34
    - At 2:16 it hit 10.02
    - At 2:17 it hit 11.0
    - At 2:24 it hit 17.0
    - At 2:27 it hit 19.0 and stayed here +/1 1.0 until
    - At 2:48 it was 18.96 and looks like it started to go down (very
slowly)
    - At 2:57 it was 17.84
    - At 3:05 it was 16.76
    - At 3:16 it was 15.03
    - At 3:27 it was 9.3
    - At 3:39 it was 4.08
    - At 3:44 it was 1.92, and stayed under 2.0 from there on

This is the 1m load average by the way (i.e. first number in /proc/loadavg,
given by top, uptime, etc).

Running top while this occurs shows very little CPU usage.  It seems the
standard cause of this is processes in a "d" state, which means waiting on
I/O.  But we're not seeing this.

In fact, I the system runs sar, and I've collected copious amounts of data.
 But I don't see anything that jumps out that correlates with these events.
 I.e., no surges in disk IO, disk read/write bytes, network traffic, etc.
 The system *never* uses any swap.

I also used "dstat" to collect all data that it can for 24 hours (so it
captured one of these events).  I used 1 second samples, loaded the info up
into a huge spreadsheet, but again, didn't see any obvious "trigger" or
interesting stuff going on while the load spiked.

All the programs running on the system seem to work fine while this is
happening... but it triggers all kinds of monitoring alerts which is
annoying.  We've been collecting data too, and as I said above, seems to
happen every 17 hours.

I checked all our cron jobs, and nothing jumped out as an obvious culprit.

Anyone seen anything like this?  Any thoughts or ideas?

Thanks,
Matt