[CentOS] [CentOS-devel] disk i/o stalls with mptsas since upgrade to centos 5.4

On Mar 16, 2010, at 3:43 AM, Lennert Buytenhek  
<buytenh at wantstofly.org> wrote:

> Hi!
>
> On two different machines, I've been experiencing disk I/O stalls  
> after
> upgrading to the CentOS 5.4 kernel.  Both machines have an LSI 1068E
> MPT SAS (mptsas) controller connected to a Chenbro CK13601 36-port SAS
> expander, with one machine having 16 1T WD disks hooked up to it, and
> the other having a mix of about 20 WD/Seagate/Samsung/Hitachi 1T and  
> 2T
> disks.
>
> When there's a disk I/O stall, all reads and writes to any disk behind
> the SAS controller/expander just hang for a while (typically for  
> almost
> exactly eight seconds), so not just the I/O to one particular disk  
> or a
> subset of the disks.  The disks on other (on-board SATA) controllers
> still pass I/O requests when the SAS I/O stalls.
>
> I hacked up the attached (dirty) perl script to demonstrate this  
> effect
> -- it will read /proc/diskstats in a tight loop, and keep track of
> which request entered the request queue when, and when it completed,  
> and
> it will WTF if a request took more than a second.  (The same thing can
> probably be done with blktrace, but I was lazy.)  New requests get
> submitted, but the pending ones fail to complete for a while, and then
> they all complete at once.
>
> This happens on kernel-2.6.18-164.11.1.el5, while reverting to the
> latest CentOS 5.3 kernel (kernel-2.6.18-128.7.1.el5) makes the issue  
> go
> away again, i.e. no more stalls.
>
> It doesn't seem to matter whether the I/O load is high or not -- the
> stalls happen even under almost no load at all.
>
> Before I dig into this further, has anyone experienced anything  
> similar?
> A quick google search didn't come up with much.

I would use iostat -x and see if there is a disk or group of disks  
that show abnormal service times and/or utilization.

Are there any errors in the logs?

How are the disks configured? Software raid?

Is the adapter's firmware at the latest revision?

Was .128 kernel running stock drivers?Is .164 kernel running stock  
drivers? (maybe weak-updates from .128 kernel?)

What IO scheduler is this? Default CFQ?

I would move this discussion to 'CentOS Users' as that is the more  
appropriate list for this.

-Ross