[CentOS] High load average, low CPU utilization

Fri Mar 28 15:04:46 UTC 2014
Matt Garman <matthew.garman at gmail.com>

On Fri, Mar 28, 2014 at 9:37 AM, John R. Dennison <jrd at gerdesas.com> wrote:
>
> On Fri, Mar 28, 2014 at 09:30:17AM -0500, Matt Garman wrote:
> >
> > How can the loadavg shoot up (from ~1 to ~20) without a corresponding
> > uptick in number of tasks?
>
> loadavg is based on number of processes vying for cpu time on the runq; the
> number of over-all processes on the system is not really relevant unless
> they are all competing for cpu.


Is there a way to see this number of processes in the runq?  From the
shell or programmatically?


> What's the i/o wait on the box when you see load spikes?  If the box is
> i/o bound (indicated by high i/o) the load average will spike due to
> processes blocked on i/o cycles.

I ran "top -b" directed to a file and captured one of these spikes.
Here's a sample from the approximate start, peak, and end of the load
spike (respectively):

top - 18:40:29 up 14 days,  1:34, 4 users,  load average: 0.80, 0.48, 0.29
Tasks: 205 total,   1 running, 204 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.2%us,  4.9%sy,  0.0%ni, 92.1%id,  0.0%wa,  0.1%hi,  1.7%si,  0.0%st

top - 19:16:00 up 14 days,  2:09, 4 users,  load average: 19.67, 19.02, 15.75
Tasks: 203 total,   1 running, 202 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.1%us,  4.6%sy,  0.0%ni, 92.3%id,  0.0%wa,  0.2%hi,  1.9%si,  0.0%st

 top - 20:20:27 up 14 days,  3:14, 4 users,  load average: 0.93, 3.58, 8.69
Tasks: 212 total,   1 running, 211 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.2%us,  4.8%sy,  0.0%ni, 91.7%id,  0.6%wa,  0.1%hi,  1.6%si,  0.0%st

Looks like I collected 17277 total top samples.  The max "%wa" over
this time was 61.1%, and less than 40 of those samples had "%wa" over
10.0.  In other words, over many hours, the system had IOwait over 10%
for less than a minute.  And note that my load spike lasts for almost
two hours.