[CentOS-virt] SATA vs RAID5 vs VMware

Fri Sep 25 04:29:46 UTC 2009
Jeff <jlar310 at gmail.com>

On Thu, Sep 24, 2009 at 8:18 PM, Philip Gwyn <liste at artware.qc.ca> wrote:
> Hello,
>
> I have strange behaviour on a server that I can't get a handle on.  I have a
> reasonably powerful server running VMware server 1.0.4-56528.  It has a RAID5
> build with mdadm on 5 SATA drives.  Masses of ram and 2 XEON CPUs.  But it
> stutters.
>
> Example : fire up vi, press and keep finger on i.  After filling 2-3 lines, the
> display is stopped for 2-12 seconds.  Then they continue.  This happens even on
> the host OS, at the console.
>
> Host system running CentOS 5.2 x86-64:
>
>  CPU : 2x Xeon E5430 @ 2.66GHz
>  RAM : 24GB
>  Mobo : DSBV-DX
>   HD : 5 x SATA ST3750330AS 750GB in RAID5
>
> There are 5 VMs, detailed at http://www.awale.qc.ca/vmware/stj1.txt to make
> this mail shorter.
>
> Seems to me this system should be more then adequate to handle the load.
>
> This is what vmstat on the host looks like when the server is "unhappy" :
>   http://www.awale.qc.ca/vmware/vmstat.txt
> Spending a lot of time in 'wa', but 'bo' and 'bi' are miniscule.
>
> The problem seems like a disk problem.  I grow to suspect that SATA isn't ready
> for the big time.  I also grow to dislike RAID5.
>
> Questions :
>
> - Anyone have a clue or other on how to track down my bottle neck?
>
> - SATA NCQ is limited to 15 queue depth.  Is this per-SATA-port or
>  per-SATA-chip? Or does this question make no sense?
>
> - I realise there are more recent versions of CentOS out.  Are there specific
>  items in the changelogs that would affect my problem?

VMware Server 1.0.x was never supported on RHEL/CentOS 5.x, especially
as early as 1.0.4. Not that it can't be made to work, but it just
wasn't made for newer kernel versions. We run up to 10 guests in
VMware Server 1.0.9 on a single Xeon quad core with the host running
CentOS 4, SATA hardware RAID 1. Admittedly, our guests are pretty low
CPU, low throughput, but it works just fine for us. If your guests are
not really hammering the disk system, then you may be on a wild goose
chase blaming RAID 5.

In my time on the VMware forums, it was always suggested to use single
CPU guests running non-smp kernels for Server 1.0.x. It might help to
convert the one smp guest you have. If you can afford some down-time,
reconfigure the host to use compatible CentOS/VMware versions
(4.x/1.0.x or 5.x/2.x respectively). At the very least, get the latest
VMware Server 1.0.9.

--
Jeff