On Thu, 2007-12-06 at 16:48 +0100, Tomasz 'Zen' Napierala wrote:
Wednesday 05 December 2007 15:39:41 J. Potter napisaĆ(a):
Hi List,
I'm stumped by this:
load average: 10.65, 594.71, 526.58
We're monitoring load every ~3 minutes. It'll be fine (i.e. something like load average: 2.14, 1.27, 1.03), and then in a single sample, jump to something like the above. This seems to happen once a week or so on a few different servers (all running in a similar application). I've never seen the 1 minute sample spike as high as the 5 or 15 minute samples.
Seeing as that last value is a 15 minute period, well, it doesn't seem possible that one can have a 500+ 15 minute sample without having observed a spike in the 5 minute sample at least 5 minutes before.
Also, there aren't 500+ processes on these systems -- it's typically around 100 total processes (ps auxw | wc -l). (Is there a way to see the total count of kernel-level threads?)
Thoughts?
As mentioned before, IO could give such strange results. I suggest launching dstat with logging to a file, and analyzing the file afterwards.
What about using sar to report the previous run queue history. AFAIK the run queue figures don't include processes in an uninterruptable sleep state (disk IO).