On Fri, Mar 28, 2014 at 9:37 AM, John R. Dennison jrd@gerdesas.com wrote:
On Fri, Mar 28, 2014 at 09:30:17AM -0500, Matt Garman wrote:
How can the loadavg shoot up (from ~1 to ~20) without a corresponding uptick in number of tasks?
loadavg is based on number of processes vying for cpu time on the runq; the number of over-all processes on the system is not really relevant unless they are all competing for cpu.
Is there a way to see this number of processes in the runq? From the shell or programmatically?
What's the i/o wait on the box when you see load spikes? If the box is i/o bound (indicated by high i/o) the load average will spike due to processes blocked on i/o cycles.
I ran "top -b" directed to a file and captured one of these spikes. Here's a sample from the approximate start, peak, and end of the load spike (respectively):
top - 18:40:29 up 14 days, 1:34, 4 users, load average: 0.80, 0.48, 0.29 Tasks: 205 total, 1 running, 204 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 4.9%sy, 0.0%ni, 92.1%id, 0.0%wa, 0.1%hi, 1.7%si, 0.0%st
top - 19:16:00 up 14 days, 2:09, 4 users, load average: 19.67, 19.02, 15.75 Tasks: 203 total, 1 running, 202 sleeping, 0 stopped, 0 zombie Cpu(s): 1.1%us, 4.6%sy, 0.0%ni, 92.3%id, 0.0%wa, 0.2%hi, 1.9%si, 0.0%st
top - 20:20:27 up 14 days, 3:14, 4 users, load average: 0.93, 3.58, 8.69 Tasks: 212 total, 1 running, 211 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 4.8%sy, 0.0%ni, 91.7%id, 0.6%wa, 0.1%hi, 1.6%si, 0.0%st
Looks like I collected 17277 total top samples. The max "%wa" over this time was 61.1%, and less than 40 of those samples had "%wa" over 10.0. In other words, over many hours, the system had IOwait over 10% for less than a minute. And note that my load spike lasts for almost two hours.