(Suggestions for other forums in which to post this question are welcome.)
We have CentOS 3.6 x86_64 running on a server with dual 2.2GHz Opterons and a Promise UltraTrak RM8000 connected via an Adaptec SCSI card. We are seeing what seems to be gradual I/O performance degradation over time; it seems to be OK for up to about 90 days, but not long after that both CPUs end up continuously spending 50-99% of their time in "iowait" state when reading/writing the RAID device, and processes begin to be stuck for minutes at a time in disk wait state, until finally the server becomes unusable.
A simple reboot, even with a forced fsck, does NOT clear this up, but a full shutdown followed by power cycling the RAID device and then rebooting, seems to return things to normal.
After doing some research when this most recently happened, we have used elvtune to lower the read and write latency on /dev/sda4 (which is the primary filesystem on the RAID) to 128 and 256 respectively. However, we don't yet know whether this will make any difference, as it has only been 48 hours since the power cycle and it usually takes months for the problem to become noticable. I'd like to get out ahead of it this time if I can, so that we either know when to schedule a power cycle or have some confidence that we won't need to.
Any information would be appreciated.
(Below this point is just hardware data in case it is helpful.)
Some data from "lshw":
description: Motherboard product: GT24-B2891 vendor: TYAN Computer Corp physical id: 0 slot: H1 L1 Cache
The SCSI card:
description: SCSI storage controller product: AIC-7892A U160/m vendor: Adaptec physical id: 8 bus info: pci@09:08.0 logical name: scsi0 version: 02 width: 64 bits clock: 66MHz capabilities: scsi bus_master cap_list scsi-host configuration: driver=aic7xxx latency=72 maxlatency=25 mingnt=40 resources: ioport:3000-30ff iomemory:df300000-df300fff irq:24
/proc/cpuinfo:
processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 248 physical id : 255 siblings : 1 stepping : 10 cpu MHz : 2210.197 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 4404.01 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp
processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 248 physical id : 255 siblings : 1 stepping : 10 cpu MHz : 2210.197 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 4404.01 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp