Wednesday 05 December 2007 15:39:41 J. Potter napisaĆ(a):
Hi List,
I'm stumped by this:
load average: 10.65, 594.71, 526.58
We're monitoring load every ~3 minutes. It'll be fine (i.e. something like load average: 2.14, 1.27, 1.03), and then in a single sample, jump to something like the above. This seems to happen once a week or so on a few different servers (all running in a similar application). I've never seen the 1 minute sample spike as high as the 5 or 15 minute samples.
Seeing as that last value is a 15 minute period, well, it doesn't seem possible that one can have a 500+ 15 minute sample without having observed a spike in the 5 minute sample at least 5 minutes before.
Also, there aren't 500+ processes on these systems -- it's typically around 100 total processes (ps auxw | wc -l). (Is there a way to see the total count of kernel-level threads?)
Thoughts?
As mentioned before, IO could give such strange results. I suggest launching dstat with logging to a file, and analyzing the file afterwards.