[CentOS-virt] Logrotate/cron and major I/O contention with KVM.

On Thu, Mar 11, 2010 at 11:02:33AM -0800, Mathew S. McCarrell wrote:
> Is anyone else having major I/O peaks due to logrotate or other jobs
> running simultaneously across multiple guests. I have one KVM server
> running Centos 5.4 with local disk that is seriously suffering as
> most of the guests rotate their syslog at the same time.
> 
> Looking at the KVM server I'm seeing
> 
> 11:00:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
> 03:40:01 AM       all      0.07      0.00      2.74      0.93      0.00     96.26
> 03:50:01 AM       all      0.07      0.00      1.17      1.18      0.00     97.58
> 04:00:01 AM       all      0.08      0.00      1.51      0.82      0.00     97.59
> 04:10:02 AM       all      0.53      0.03     15.31     51.61      0.00     32.53
> 04:20:01 AM       all      0.28      0.12      4.12     22.21      0.00     73.27
> 04:30:01 AM       all      0.07      0.00      0.80      1.21      0.00     97.92
> 04:40:01 AM       all      0.07      0.00      2.60      1.81      0.00     95.52
> 04:50:01 AM       all      0.08      0.00      0.79      1.44      0.00     97.69
> 
> On one of the guests running Centos 4.6 the impact is so bad I get
> DMA timeout errors in the syslog, and occasional kernel panics.
> 
> Mar 11 04:05:04 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
> Mar 11 04:05:14 localhost kernel: hda: DMA timeout error
> Mar 11 04:05:14 localhost kernel: hda: dma timeout error: status=0x50 { DriveReady SeekComplete }
> Mar 11 04:05:14 localhost kernel:
> Mar 11 04:05:14 localhost kernel: ide: failed opcode was: unknown
> Mar 11 04:05:59 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
> Mar 11 04:06:14 localhost kernel: hda: DMA timeout error
> Mar 11 04:06:14 localhost kernel: hda: dma timeout error: status=0x50 { DriveReady SeekComplete }
> 
> One reference I've found is at
>  * http://lonesysadmin.net/linux-virtual-machine-tuning-guide/
> 
> This suggests avoiding running scheduled jobs simultaneously across
> guests, and suggests using a random sleep.

I think this is a pretty good suggestion.

> 
> Does anyone else have suggestions on reducing the impact of
> cron/logrotate.

You might also consider increasing the device timeouts on your block
devices at the guest level:

  echo 120 > /sys/block/sda/device/timeout

etc, etc.  That or increase the performance of your storage :)

> 
> I ran into this issue as well on a box running Xen with local
> storage.
> 
> My solution was to modify /etc/crontab to run /etc/cron.weekly at
> different times for each guest and for the dom0.  I modified the
> entry on each VM to be 10 minutes after the previous one and have not
> seen any load spikes since then.
> 
> Matt

Ray