On Sep 26, 2011, at 3:11 PM, Benjamin Smith <lists at benjamindsmith.com> wrote: > I'm trying to figure out why 2 machines have a "hard I/O lock" on the HDD when > running EL6. > > I have 4 identical machines, all were stable with EL5. 2 work great with EL6, > 2 do not. I've checked momtherboard BIOS versions and settings, SAS controller > BIOS versions and settings, they are the same between the working and non- > working systems. > > When booting a non-working system, it boots straight up to the boot prompt > (runlevel 3) without issue, and everything works fine. When the machine sits > idle for a period of time (ranging from 15 minutes or so and up) the HDD > becomes unreadable/unwritable and the system is useless for any purpose and > must be hard restarted with a full power cycle - it won't even shut down. > > Since nothing is logged, I've had precious little information to diagnose > with. After several attempts to find out what's going on, I find the following > emitted to the screen: > > mpt2sas0: diag reset: FAILED > mpt2sas0: diag reset: FAILED > mpt2sas0: diag reset: FAILED > end_request: I/O error, dev sda, sector 226972349 > Buffer I/O error, device sda5, logical block 2719747 > sd 0:0:0:0rejecting I/O to offline device > sd 0:0:0:0rejecting I/O to offline device > sd 0:0:0:0rejecting I/O to offline device > > This is NOT due to a faulty HDD: I've tried new hard disks, SATA/SAS, I've > swapped hard disks with an identical working unit and verified that the working > unit remains working and the failing unit continues to fail. I've reformatted > and re-installed EL6 numerous times with consistent results. > > Googling this error returned very little useful information: where should I go > now? Below, please find outputs of dmesg and lspci. I've compared outputs of > dmesg between working and nonworking systems, the output of anything with > "mpt" at the beginning is identical except for different IRQ ports. Tried upgrading BIOS? Errors during idle periods might point to C-State or P-State compatibility issues. You could try disabling the power management (Speedstep) in the BIOS and see if that makes a difference. -Ross