Hi List,
I'm stumped by this:
load average: 10.65, 594.71, 526.58
We're monitoring load every ~3 minutes. It'll be fine (i.e. something like load average: 2.14, 1.27, 1.03), and then in a single sample, jump to something like the above. This seems to happen once a week or so on a few different servers (all running in a similar application). I've never seen the 1 minute sample spike as high as the 5 or 15 minute samples.
Seeing as that last value is a 15 minute period, well, it doesn't seem possible that one can have a 500+ 15 minute sample without having observed a spike in the 5 minute sample at least 5 minutes before.
Also, there aren't 500+ processes on these systems -- it's typically around 100 total processes (ps auxw | wc -l). (Is there a way to see the total count of kernel-level threads?)
Thoughts?
best, Jeff
Linux someHostName 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:39:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
CentOS release 5 (Final)
09:31:15 up 65 days, 17:45, 2 users, load average: 0.92, 200.91, 371.30
Wednesday 05 December 2007 15:39:41 J. Potter napisał(a):
Hi List,
I'm stumped by this:
load average: 10.65, 594.71, 526.58
We're monitoring load every ~3 minutes. It'll be fine (i.e. something like load average: 2.14, 1.27, 1.03), and then in a single sample, jump to something like the above. This seems to happen once a week or so on a few different servers (all running in a similar application). I've never seen the 1 minute sample spike as high as the 5 or 15 minute samples.
Seeing as that last value is a 15 minute period, well, it doesn't seem possible that one can have a 500+ 15 minute sample without having observed a spike in the 5 minute sample at least 5 minutes before.
Also, there aren't 500+ processes on these systems -- it's typically around 100 total processes (ps auxw | wc -l). (Is there a way to see the total count of kernel-level threads?)
Thoughts?
As mentioned before, IO could give such strange results. I suggest launching dstat with logging to a file, and analyzing the file afterwards.
On Thu, 2007-12-06 at 16:48 +0100, Tomasz 'Zen' Napierala wrote:
Wednesday 05 December 2007 15:39:41 J. Potter napisał(a):
Hi List,
I'm stumped by this:
load average: 10.65, 594.71, 526.58
We're monitoring load every ~3 minutes. It'll be fine (i.e. something like load average: 2.14, 1.27, 1.03), and then in a single sample, jump to something like the above. This seems to happen once a week or so on a few different servers (all running in a similar application). I've never seen the 1 minute sample spike as high as the 5 or 15 minute samples.
Seeing as that last value is a 15 minute period, well, it doesn't seem possible that one can have a 500+ 15 minute sample without having observed a spike in the 5 minute sample at least 5 minutes before.
Also, there aren't 500+ processes on these systems -- it's typically around 100 total processes (ps auxw | wc -l). (Is there a way to see the total count of kernel-level threads?)
Thoughts?
As mentioned before, IO could give such strange results. I suggest launching dstat with logging to a file, and analyzing the file afterwards.
What about using sar to report the previous run queue history. AFAIK the run queue figures don't include processes in an uninterruptable sleep state (disk IO).
As mentioned before, IO could give such strange results. I suggest launching dstat with logging to a file, and analyzing the file afterwards.
Thanks, much appreciated!
This has yielded some interesting data, which I'll attempt to include a few seconds before and after one of these events occurred.
system interrupts per second: Note the ~200x jump to almost 200,000 interrupts per second. 2907 6714 1371 194218 2456 2907
network received: Note the network received ramps up over 5 seconds, peaks at ~50x background, and ramps back down in about 3 seconds. The peak is from the same sample as the 200x sample above. 108784 389794 1070850 4843956 352226 353102 96392
Everything else looks sane -- there's enough ram, nothing's being swapped out, etc. This is on a private-network server that has a load balancer in front of it, so if it's network related, it wouldn't be misdirected random bits.
Has anyone seen this sort of behavior before? What was the cause? What should I do to figure out how to keep the load averages from flipping out of control?
(This isn't something as lame as a counter rolling over somewhere internal to the kernel, is it? Wouldn't think so, but thought to ask. Running 2.6.18-8.1.8.el5. We could reboot to run 2.6.18-8.1.15 if that'd be a potential fix.)
Thanks for any insight!
best, Jeff
total cpu usage dsk/total net/total system usr sys read writ recv send int csw 10.5 3.25 0 409600 108784 72286 2907 20376 3.99 2.993 0 319488 389794 661170 6714 23941 0.25 0.25 0 720896 1070850 1189720 1371 16648 9.167 90.442 12288 1122304 4843956 399998 194218 55433 56.931 16.832 0 1273856 352226 334506 2456 12844 46.25 20 0 454656 353102 384496 2907 20631 24.25 1.25 0 3260416 96392 72316 1342 17307 23.25 2.25 0 610304 91086 71194 1458 17584 10.973 1.496 0 0 84192 46276 1349 18135 0 0 0 94208 71892 33304 1220 16979 0.25 0.25 0 126976 71184 47576 1268 16973