[CentOS] High load average, low CPU utilization

Fri Mar 28 14:30:17 UTC 2014
Matt Garman <matthew.garman at gmail.com>

On Fri, Mar 28, 2014 at 9:01 AM, Mr Queue <lists at mrqueue.com> wrote:

> On Thu, 27 Mar 2014 17:20:22 -0500
> Matt Garman <matthew.garman at gmail.com> wrote:
>
> > Anyone seen anything like this?  Any thoughts or ideas?
>
> Post some data.. This public facing? Are you getting sprayed down by
> packets? Array? Soft/hard? Someone have screens
> laying around? Write a trap to catch a process list when the loads spike?
> Look at crontab(s)? User accounts? Malicious
> shells? Any guest containers around? Possibilities are sort of endless
> here.
>


Not public facing (no Internet access at all).  Linux software RAID-1. No
screen or tmux data.  No guest access of any kind.  In fact, only three
logged in users.

I've reviewed crontabs (there are only a couple), and I don't see anything
out of the ordinary.  Malicious shells or programs: possibly, but I think
that is highly unlikely... if someone were going to do something malicious,
*this* particular server is not the one to target.

What kind of data would help?  I have sar running at a five second
interval.  I also did a 24-hour run of dstat at a one second interval
collecting all information it could.  I have tons of data, but not sure how
to "distill" it down to a mailing-list friendly format.  But a colleague
and I reviewed the data, and don't see any correlation with other system
data before, during, or after these load spike events.

I did a little research on the loadavg number, and my understanding is that
it's simply a function of the number of tasks on the system.  (There's some
fancy stuff thrown in for exponential decay and curve smoothing and all
that, but it's still based on the number of system tasks.)

I did a simple run of "top -b > top_output.txt" for a 24-hour period, which
captured another one of these events.  I haven't had a chance to study it
in detail, but I expected the number of tasks to shoot up dramatically
around the time of these load spikes.  The number of tasks remained fairly
constant: about 200 +/- 5.

How can the loadavg shoot up (from ~1 to ~20) without a corresponding
uptick in number of tasks?