(Suggestions for other forums in which to post this question are welcome.)
We have CentOS 3.6 x86_64 running on a server with dual 2.2GHz
Opterons and a Promise UltraTrak RM8000 connected via an Adaptec SCSI
card. We are seeing what seems to be gradual I/O performance
degradation over time; it seems to be OK for up to about 90 days, but
not long after that both CPUs end up continuously spending 50-99% of
their time in "iowait" state when reading/writing the RAID device, and
processes begin to be stuck for minutes at a time in disk wait state,
until finally the server becomes unusable.
A simple reboot, even with a forced fsck, does NOT clear this up, but
a full shutdown followed by power cycling the RAID device and then
rebooting, seems to return things to normal.
After doing some research when this most recently happened, we have
used elvtune to lower the read and write latency on /dev/sda4 (which
is the primary filesystem on the RAID) to 128 and 256 respectively.
However, we don't yet know whether this will make any difference, as
it has only been 48 hours since the power cycle and it usually takes
months for the problem to become noticable. I'd like to get out ahead
of it this time if I can, so that we either know when to schedule a
power cycle or have some confidence that we won't need to.
Any information would be appreciated.
(Below this point is just hardware data in case it is helpful.)
Some data from "lshw":
description: Motherboard
product: GT24-B2891
vendor: TYAN Computer Corp
physical id: 0
slot: H1 L1 Cache
The SCSI card:
description: SCSI storage controller
product: AIC-7892A U160/m
vendor: Adaptec
physical id: 8
bus info: pci@09:08.0
logical name: scsi0
version: 02
width: 64 bits
clock: 66MHz
capabilities: scsi bus_master cap_list scsi-host
configuration: driver=aic7xxx latency=72 maxlatency=25 mingnt=40
resources: ioport:3000-30ff iomemory:df300000-df300fff irq:24
/proc/cpuinfo:
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 248
physical id : 255
siblings : 1
stepping : 10
cpu MHz : 2210.197
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow
bogomips : 4404.01
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 248
physical id : 255
siblings : 1
stepping : 10
cpu MHz : 2210.197
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow
bogomips : 4404.01
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp