[CentOS] Hard I/O lockup with EL6

Mon Sep 26 22:52:29 UTC 2011
Ross Walker <rswwalker at gmail.com>

On Sep 26, 2011, at 3:11 PM, Benjamin Smith <lists at benjamindsmith.com> wrote:

> I'm trying to figure out why 2 machines have a "hard I/O lock" on the HDD when 
> running EL6. 
> 
> I have 4 identical machines, all were stable with EL5. 2 work great with EL6, 
> 2 do not. I've checked momtherboard BIOS versions and settings, SAS controller 
> BIOS versions and settings, they are the same between the working and non-
> working systems. 
> 
> When booting a non-working system, it boots straight up to the boot prompt 
> (runlevel 3) without issue, and everything works fine. When the machine sits 
> idle for a period of time (ranging from 15 minutes or so and up) the HDD 
> becomes unreadable/unwritable and the system is useless for any purpose and 
> must be hard restarted with a full power cycle - it won't even shut down. 
> 
> Since nothing is logged, I've had precious little information to diagnose 
> with. After several attempts to find out what's going on, I find the following 
> emitted to the screen: 
> 
> mpt2sas0: diag reset: FAILED 
> mpt2sas0: diag reset: FAILED 
> mpt2sas0: diag reset: FAILED 
> end_request: I/O error, dev sda, sector 226972349
> Buffer I/O error, device sda5, logical block 2719747
> sd 0:0:0:0rejecting I/O to offline device 
> sd 0:0:0:0rejecting I/O to offline device 
> sd 0:0:0:0rejecting I/O to offline device 
> 
> This is NOT due to a faulty HDD: I've tried new hard disks, SATA/SAS, I've 
> swapped hard disks with an identical working unit and verified that the working 
> unit remains working and the failing unit continues to fail. I've reformatted 
> and re-installed EL6 numerous times with consistent results. 
> 
> Googling this error returned very little useful information: where should I go 
> now? Below, please find outputs of dmesg and lspci. I've compared outputs of 
> dmesg between working and nonworking systems, the output of anything with 
> "mpt" at the beginning is identical except for different IRQ ports. 

Tried upgrading BIOS?

Errors during idle periods might point to C-State or P-State compatibility issues.

You could try disabling the power management (Speedstep) in the BIOS and see if that makes a difference.

-Ross