I have a dual Xeon 5130 (four total CPUs) server running CentOS 5.7. Approximately every 17 hours, the load on this server slowly creeps up until it hits 20, then slowly goes back down.
The most recent example started around 2:00am this morning. Outside of these weird times, the load never exceeds 2.0 (and in fact spends the overwhelming majority of its time at 1.0). So this morning, a few data points:
- 2:06 to 2:07 load increased from 1.0 to 2.0 - At 2:09 it hit 4.0 - At 2:10 it hit 5.34 - At 2:16 it hit 10.02 - At 2:17 it hit 11.0 - At 2:24 it hit 17.0 - At 2:27 it hit 19.0 and stayed here +/1 1.0 until - At 2:48 it was 18.96 and looks like it started to go down (very slowly) - At 2:57 it was 17.84 - At 3:05 it was 16.76 - At 3:16 it was 15.03 - At 3:27 it was 9.3 - At 3:39 it was 4.08 - At 3:44 it was 1.92, and stayed under 2.0 from there on
This is the 1m load average by the way (i.e. first number in /proc/loadavg, given by top, uptime, etc).
Running top while this occurs shows very little CPU usage. It seems the standard cause of this is processes in a "d" state, which means waiting on I/O. But we're not seeing this.
In fact, I the system runs sar, and I've collected copious amounts of data. But I don't see anything that jumps out that correlates with these events. I.e., no surges in disk IO, disk read/write bytes, network traffic, etc. The system *never* uses any swap.
I also used "dstat" to collect all data that it can for 24 hours (so it captured one of these events). I used 1 second samples, loaded the info up into a huge spreadsheet, but again, didn't see any obvious "trigger" or interesting stuff going on while the load spiked.
All the programs running on the system seem to work fine while this is happening... but it triggers all kinds of monitoring alerts which is annoying. We've been collecting data too, and as I said above, seems to happen every 17 hours.
I checked all our cron jobs, and nothing jumped out as an obvious culprit.
Anyone seen anything like this? Any thoughts or ideas?
Thanks, Matt
On 2014/03/27 12:20, Matt Garman wrote:
Anyone seen anything like this? Any thoughts or ideas?
Thanks, Matt
Something of a shot in the dark, but when we had a server with a high load average where nothing obvious was causing it, it turned out to be multiple df cmds hanging on a stale nfs mount. This command helped us id it:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'
Hope that helps, Miranda
On 2014-03-27, Miranda Hawarden-Ogata hawarden@ifa.hawaii.edu wrote:
On 2014/03/27 12:20, Matt Garman wrote:
Anyone seen anything like this? Any thoughts or ideas?
Something of a shot in the dark, but when we had a server with a high load average where nothing obvious was causing it, it turned out to be multiple df cmds hanging on a stale nfs mount.
Wouldn't those show up in D state? The OP mentioned that he didn't see processes hanging on IO.
--keith
On Thu, 27 Mar 2014 17:20:22 -0500 Matt Garman matthew.garman@gmail.com wrote:
Anyone seen anything like this? Any thoughts or ideas?
Post some data.. This public facing? Are you getting sprayed down by packets? Array? Soft/hard? Someone have screens laying around? Write a trap to catch a process list when the loads spike? Look at crontab(s)? User accounts? Malicious shells? Any guest containers around? Possibilities are sort of endless here.
On Fri, Mar 28, 2014 at 9:01 AM, Mr Queue lists@mrqueue.com wrote:
On Thu, 27 Mar 2014 17:20:22 -0500 Matt Garman matthew.garman@gmail.com wrote:
Anyone seen anything like this? Any thoughts or ideas?
Post some data.. This public facing? Are you getting sprayed down by packets? Array? Soft/hard? Someone have screens laying around? Write a trap to catch a process list when the loads spike? Look at crontab(s)? User accounts? Malicious shells? Any guest containers around? Possibilities are sort of endless here.
Not public facing (no Internet access at all). Linux software RAID-1. No screen or tmux data. No guest access of any kind. In fact, only three logged in users.
I've reviewed crontabs (there are only a couple), and I don't see anything out of the ordinary. Malicious shells or programs: possibly, but I think that is highly unlikely... if someone were going to do something malicious, *this* particular server is not the one to target.
What kind of data would help? I have sar running at a five second interval. I also did a 24-hour run of dstat at a one second interval collecting all information it could. I have tons of data, but not sure how to "distill" it down to a mailing-list friendly format. But a colleague and I reviewed the data, and don't see any correlation with other system data before, during, or after these load spike events.
I did a little research on the loadavg number, and my understanding is that it's simply a function of the number of tasks on the system. (There's some fancy stuff thrown in for exponential decay and curve smoothing and all that, but it's still based on the number of system tasks.)
I did a simple run of "top -b > top_output.txt" for a 24-hour period, which captured another one of these events. I haven't had a chance to study it in detail, but I expected the number of tasks to shoot up dramatically around the time of these load spikes. The number of tasks remained fairly constant: about 200 +/- 5.
How can the loadavg shoot up (from ~1 to ~20) without a corresponding uptick in number of tasks?
On Fri, Mar 28, 2014 at 09:30:17AM -0500, Matt Garman wrote:
How can the loadavg shoot up (from ~1 to ~20) without a corresponding uptick in number of tasks?
loadavg is based on number of processes vying for cpu time on the runq; the number of over-all processes on the system is not really relevant unless they are all competing for cpu.
What's the i/o wait on the box when you see load spikes? If the box is i/o bound (indicated by high i/o) the load average will spike due to processes blocked on i/o cycles.
John
On Fri, Mar 28, 2014 at 9:37 AM, John R. Dennison jrd@gerdesas.com wrote:
On Fri, Mar 28, 2014 at 09:30:17AM -0500, Matt Garman wrote:
How can the loadavg shoot up (from ~1 to ~20) without a corresponding uptick in number of tasks?
loadavg is based on number of processes vying for cpu time on the runq; the number of over-all processes on the system is not really relevant unless they are all competing for cpu.
Is there a way to see this number of processes in the runq? From the shell or programmatically?
What's the i/o wait on the box when you see load spikes? If the box is i/o bound (indicated by high i/o) the load average will spike due to processes blocked on i/o cycles.
I ran "top -b" directed to a file and captured one of these spikes. Here's a sample from the approximate start, peak, and end of the load spike (respectively):
top - 18:40:29 up 14 days, 1:34, 4 users, load average: 0.80, 0.48, 0.29 Tasks: 205 total, 1 running, 204 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 4.9%sy, 0.0%ni, 92.1%id, 0.0%wa, 0.1%hi, 1.7%si, 0.0%st
top - 19:16:00 up 14 days, 2:09, 4 users, load average: 19.67, 19.02, 15.75 Tasks: 203 total, 1 running, 202 sleeping, 0 stopped, 0 zombie Cpu(s): 1.1%us, 4.6%sy, 0.0%ni, 92.3%id, 0.0%wa, 0.2%hi, 1.9%si, 0.0%st
top - 20:20:27 up 14 days, 3:14, 4 users, load average: 0.93, 3.58, 8.69 Tasks: 212 total, 1 running, 211 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 4.8%sy, 0.0%ni, 91.7%id, 0.6%wa, 0.1%hi, 1.6%si, 0.0%st
Looks like I collected 17277 total top samples. The max "%wa" over this time was 61.1%, and less than 40 of those samples had "%wa" over 10.0. In other words, over many hours, the system had IOwait over 10% for less than a minute. And note that my load spike lasts for almost two hours.
From: Matt Garman matthew.garman@gmail.com
I did a little research on the loadavg number, and my understanding is that it's simply a function of the number of tasks on the system. (There's some fancy stuff thrown in for exponential decay and curve smoothing and all that, but it's still based on the number of system tasks.)
Any USB device? Each time I access USB disks, load goes through the roof.
JD
On Fri, Mar 28, 2014 at 10:30 AM, John Doe jdmls@yahoo.com wrote:
Any USB device? Each time I access USB disks, load goes through the roof.
Nope, it's a rack server in a secure remote location, with no peripherals at all attached. Only attached cables are power and network.
Am 28.03.2014 um 15:30 schrieb Matt Garman matthew.garman@gmail.com:
On Fri, Mar 28, 2014 at 9:01 AM, Mr Queue lists@mrqueue.com wrote:
On Thu, 27 Mar 2014 17:20:22 -0500 Matt Garman matthew.garman@gmail.com wrote:
Anyone seen anything like this? Any thoughts or ideas?
Post some data.. This public facing? Are you getting sprayed down by packets? Array? Soft/hard? Someone have screens laying around? Write a trap to catch a process list when the loads spike? Look at crontab(s)? User accounts? Malicious shells? Any guest containers around? Possibilities are sort of endless here.
Not public facing (no Internet access at all). Linux software RAID-1. No screen or tmux data. No guest access of any kind. In fact, only three logged in users.
I've reviewed crontabs (there are only a couple), and I don't see anything out of the ordinary. Malicious shells or programs: possibly, but I think that is highly unlikely... if someone were going to do something malicious, *this* particular server is not the one to target.
- update the os (current is far from 5.7)
- partition alignment?
- "heuristic/try and error"-approach: disable all crontabs and check the behavior - any load?
-- LF
On Thu, 27 Mar 2014 17:20:22 -0500 Matt Garman matthew.garman@gmail.com wrote:
Any thoughts or ideas?
Start digging into your array. Perhaps you're starting to lose a drive and it's running daily integrity checks or something. ie, dropping in and out of the array or the like.. /var/log/messages might have some clues..
(not cat, but tac) tac /var/log/messages | less
Don't forget about the crons in /etc/cron*