I'm trying to figure out what's causing an average system load of 3+ to 5+ on an Intel quad core. The server has with 2 KVM guests (assigned 1 core and 2 cores) that's lightly loaded (0.1~0.4) each. Both guest/host are running 64bit CentOS 5.6
Originally I suspected maybe it's i/o but on checking, there is very little i/o wait % as well. Plenty of free disk space available on all physical drives and memory sufficient for the usage with barely any swap in use.
While performance/responsiveness of the host and guests doesn't seem to be affected, I'm still concerned about this odd load figures. Would appreciate it if anybody can suggest what else I should be looking at?
Output of various commands on the host:
Top === top - 15:15:39 up 5 days, 8:48, 1 user, load average: 4.76, 4.18, 4.43 Tasks: 210 total, 1 running, 209 sleeping, 0 stopped, 0 zombie Cpu(s): 4.1%us, 2.5%sy, 0.0%ni, 76.4%id, 17.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 3759496k total, 2813076k used, 946420k free, 357300k buffers Swap: 8193016k total, 5648k used, 8187368k free, 736640k cached
free -m ====== total used free shared buffers cached Mem: 3671 2750 921 0 348 719 -/+ buffers/cache: 1682 1989 Swap: 8000 5 7995
vmstat 3 ====== procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 5648 949072 357096 735788 0 0 105 50 17 22 2 1 95 1 0 0 0 5648 949428 357104 735788 0 0 44 2682 14142 14697 6 4 84 6 0 0 0 5648 949304 357104 735788 0 0 1 324 14047 14753 2 1 96 1 0 1 1 5648 946916 357104 735788 0 0 0 215 14410 14496 2 2 90 6 0 0 0 5648 946520 357104 735788 0 0 23 267 13703 14664 2 2 91 6 0
sar === 02:40:01 PM CPU %user %nice %system %iowait %steal %idle 02:50:01 PM all 2.17 0.00 2.18 4.30 0.00 91.35 03:00:01 PM all 2.47 0.00 2.23 3.57 0.00 91.73 03:10:01 PM all 2.29 0.00 2.07 3.77 0.00 91.87 03:20:01 PM all 2.63 0.00 2.07 3.28 0.00 92.03 Average: all 2.39 0.00 1.95 4.77 0.00 90.89
Hello Emmanuel,
On Wed, 2011-06-08 at 15:26 +0800, Emmanuel Noobadmin wrote:
Originally I suspected maybe it's i/o but on checking, there is very little i/o wait % as well.
Cpu(s): 4.1%us, 2.5%sy, 0.0%ni, 76.4%id, 17.0%wa, 0.0%hi, 0.1%si, 0.0%st
17% i/o wait time seems a significant amount to me. I'm not sure if that's unusual when running multiple VMs, but it's probably worth investigating a bit further.
Regards, Leonard.
On 06/08/11 02:26, Emmanuel Noobadmin wrote:
Cpu(s): 4.1%us, 2.5%sy, 0.0%ni, 76.4%id, 17.0%wa, 0.0%hi, 0.1%si, 0.0%st
02:50:01 PM all 2.17 0.00 2.18 4.30 0.00 91.35 03:00:01 PM all 2.47 0.00 2.23 3.57 0.00 91.73
top Cpu(s) line is averaged for all cpus/cores. to display individual cpus/cores press: 1 you'll likely see one cpu/core being pegged with iowait. to identify the offending process within top press: fj<enter> to display the P column(last used CPU). watch top for a few minutes to see what is using all of the disk io.
sar output is averaged over the 10 minute interval. for smaller sar time slices edit cron file: /etc/cron.d/sysstat
disks are often swamped by "two things happening at once"... backups migrating a VM database upgrades .rrd average updates
On 6/9/11, Steven Tardy sjt5@its.msstate.edu wrote:
top Cpu(s) line is averaged for all cpus/cores. to display individual cpus/cores press: 1 you'll likely see one cpu/core being pegged with iowait. to identify the offending process within top press: fj<enter> to display the P column(last used CPU). watch top for a few minutes to see what is using all of the disk io.
Thanks for these tips, it really helped narrow down the issue. Became quite clear that cpu 0 was taking up most of the user and sys time, somewhere in the 10x compared to the other 3.
Based on the VM memory usage, I think I know which VM it is but I'm going to start pinning it to confirm it's the culprit.
sar output is averaged over the 10 minute interval. for smaller sar time slices edit cron file: /etc/cron.d/sysstat
disks are often swamped by "two things happening at once"... backups migrating a VM database upgrades .rrd average updates
Unfortunately, the VMs are public facing and the offending one has got a relatively popular Wordpress blog as well as relatively high email traffic. So it's likely the result of those two things happening at once. I'm increasing the memory allocation on it and hope maybe more of the Wordpress content gets cached and see if it helps.
The odd thing is I set the VM to 512MB but a max of 1.5G assuming that KVM will assign the extra memory as needed but it seems to be stuck at 512MB.
Emmanuel Noobadmin wrote:
On 6/9/11, Steven Tardy sjt5@its.msstate.edu wrote:
<snip>
The odd thing is I set the VM to 512MB but a max of 1.5G assuming that KVM will assign the extra memory as needed but it seems to be stuck at 512MB.
*sigh* Is this a java process? If so, look at the configuration, and see what it's mem.max and shm.mem are.
You might also look at this thread, and see if it relates to what you're doing: http://communities.vmware.com/thread/291824
mark "I hate java"
On 6/9/2011 12:02 PM, Emmanuel Noobadmin wrote:
disks are often swamped by "two things happening at once"... backups migrating a VM database upgrades .rrd average updates
Unfortunately, the VMs are public facing and the offending one has got a relatively popular Wordpress blog as well as relatively high email traffic. So it's likely the result of those two things happening at once.
Don't forget the VM's don't isolate contention for the physical disk heads on the host device(s). They just sort of hide your ability to see where that time goes.
On 6/9/2011 1:02 PM, Emmanuel Noobadmin wrote:
On 6/9/11, Steven Tardysjt5@its.msstate.edu wrote:
top Cpu(s) line is averaged for all cpus/cores. to display individual cpus/cores press: 1 you'll likely see one cpu/core being pegged with iowait. to identify the offending process within top press: fj<enter> to display the P column(last used CPU). watch top for a few minutes to see what is using all of the disk io.
Thanks for these tips, it really helped narrow down the issue. Became quite clear that cpu 0 was taking up most of the user and sys time, somewhere in the 10x compared to the other 3.
Also consider installing "atop", which I find to be a bit more self-explanatory then regular "top".
On 06/09/11 11:52 AM, Thomas Harold wrote:
Also consider installing "atop", which I find to be a bit more self-explanatory then regular "top".
another cool tool is IBM's NMON, works something like TOP but has a lot more types of info you can selectively display, including disk utilization.
On 06/08/2011 12:26 AM, Emmanuel Noobadmin wrote:
I'm trying to figure out what's causing an average system load of 3+ to 5+ on an Intel quad core
watch 'ps axf | awk "{ if ( $3 !~ /S/ ) { print; } }"'
The processes that aren't sleeping count toward your load. The above command will print non-sleeping processes. If I'm not mistaken, that will tell you what's causing the load regardless of whether it's contention over CPU, IO, or other causes.
You can get similar data from "top" if you tell it to sort by process status ( F, w, Enter ) and then reverse the sort ( R ).