[CentOS] CentOS 3 - I/O performance with Promise HW RAID

Sat Oct 21 20:13:27 UTC 2006
Bart Schaefer <barton.schaefer at gmail.com>

(Suggestions for other forums in which to post this question are welcome.)

We have CentOS 3.6 x86_64 running on a server with dual 2.2GHz
Opterons and a Promise UltraTrak RM8000 connected via an Adaptec SCSI
card.  We are seeing what seems to be gradual I/O performance
degradation over time; it seems to be OK for up to about 90 days, but
not long after that both CPUs end up continuously spending 50-99% of
their time in "iowait" state when reading/writing the RAID device, and
processes begin to be stuck for minutes at a time in disk wait state,
until finally the server becomes unusable.

A simple reboot, even with a forced fsck, does NOT clear this up, but
a full shutdown followed by power cycling the RAID device and then
rebooting, seems to return things to normal.

After doing some research when this most recently happened, we have
used elvtune to lower the read and write latency on /dev/sda4 (which
is the primary filesystem on the RAID) to 128 and 256 respectively.
However, we don't yet know whether this will make any difference, as
it has only been 48 hours since the power cycle and it usually takes
months for the problem to become noticable.  I'd like to get out ahead
of it this time if I can, so that we either know when to schedule a
power cycle or have some confidence that we won't need to.

Any information would be appreciated.

(Below this point is just hardware data in case it is helpful.)

Some data from "lshw":

       description: Motherboard
       product: GT24-B2891
       vendor: TYAN Computer Corp
       physical id: 0
       slot: H1 L1 Cache

The SCSI card:

       description: SCSI storage controller
       product: AIC-7892A U160/m
       vendor: Adaptec
       physical id: 8
       bus info: pci at 09:08.0
       logical name: scsi0
       version: 02
       width: 64 bits
       clock: 66MHz
       capabilities: scsi bus_master cap_list scsi-host
       configuration: driver=aic7xxx latency=72 maxlatency=25 mingnt=40
       resources: ioport:3000-30ff iomemory:df300000-df300fff irq:24

/proc/cpuinfo:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor 248
physical id     : 255
siblings        : 1
stepping        : 10
cpu MHz         : 2210.197
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow
bogomips        : 4404.01
TLB size        : 1088 4K pages
clflush size    : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor 248
physical id     : 255
siblings        : 1
stepping        : 10
cpu MHz         : 2210.197
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow
bogomips        : 4404.01
TLB size        : 1088 4K pages
clflush size    : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp