[CentOS] rejecting I/O to offline device (PERC woes)

Mon Apr 21 21:34:13 UTC 2008
Curtis H. Wilbar Jr. <curtis+centos at cmarket.com>

Haven't gotten any tips on a solution to the problem below.
It happened again this weekend.

My next test steps (order not determined):

1. Downgrade to CentOS 4
2. Swap out PERC controller with a spare

I have never had a problem with the PERC4/DC controllers on our
other machines (RHEL3/4, CentOS 4).  Although, I've no other
machine that has 5 300G Fujitsu SCSI drives either.

Any suggestions on the below, or which order on the above to
try ?

Thanks,

-- Curt

-------------------------------

I have a 6650 with a PERC4/DC running CentOS5.

After 1 to 3 weeks of operation (running VMWare Server) it
'dies' (raid array gets taken offline) and you get rejecting
I/O to offline device.

When this system was setup late last year, the 6650 was
given all the latest firmware along with the PERC4/DC.

using linttylog, the last entries from when the system must
have 'checked out' last night, I see the data attached below.

Some time back I thought I had cured this problem by adding
noapic to the kernel boot parameters in boot.conf.  It had
gone away for a long time... but is now back.

according to lintty, it reports controller firmware is:

T0: Firmware version 352D build on Mar 19 2007 at 17:43:23
T0: MegaRAID Series 518 firmware version 352D

using strings tty.log | grep 'MedErr on pd' | cut -c17- | sort  | uniq
-c | sort -n I see:

    163 REC:log MedErr on pd[1] #retries=0
    165 REC:log MedErr on pd[4] #retries=0
    168 REC:log MedErr on pd[2] #retries=0

If I am to believe this, Patrol read is finding media errors on
physical drives 1, 2, and 4 ! ?

These drives are not even a year old, and to have an almost even
distribution of errors across 3 drives seems far fetched (unless
patrol read is reading past the end of drive ?, but then it would
be doing that with all 5 drives).

Is the PERC busted ?  driver issue ?

I'm running CentOS 5 with kernel 2.6.18-53.1.13.el5PAE

from dmesg, megaraid related driver versions:

megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)

Anyone seen this behavior before ?  Anyone have a solution ?
We have several Dells in a hosting environment with PERC4/DC
running RHEL3, RHEL4.X, and CentOS4.X.  We have not had this
issue on any of them (though they do not have 5 300G Fujitsu
SCSI drives in a RAID 5 config either (as this one does)).

Hoping someone can shed some light on this... so far I keep
coming up short on finding a solution.

Here is the full content of the last lines recorded in the PERC
as pulled by linttylog:

03/24 21:43:41: Next PR scheduled to start at 03/25  0:00:00
03/25  3:47:41: REC:log MedErr on pd[4] #retries=0
03/25  3:47:41: LogSense: pd=04, cdb=2f 00 14 30 03 12 00 ff ff 00 
03/25  3:47:41:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:47:41: REC: MedErr on LD[4] BadLba=143014e6
03/25  3:47:41: prCallback: Medium Error on pd=04, StartLba=14300312,
ErrLba=143014e6
03/25  3:47:42: prRecQueue: starting pd=04 recovery - blocking host
commands
03/25  3:47:42: prRecGo: Ready to attempt recovery errLBA=143014e6 on
pd=04
03/25  3:47:42: prGetLDInfo: MediaErr in ld=0, span=0, arm=4
03/25  3:47:42: prRecGo: dataErr found on ld 0 span 0 arm 4
03/25  3:47:42: prRecGo: data NOT in cache; cacheLn=ffffffff,
row=143014, stripe=50c052, refBlk=e6, type=1
03/25  3:47:42: prRecGo: R5-get cacheLn=133, c_ptr=d0dc1045 mem=d0daf9a4
& setup cInx=1db c=d0d26ee6
03/25  3:47:42: RtnFrmPrcRcyRd for arm=1 mem=d0db23f1 stripe=143014
type=2 i=0 status=ffff
03/25  3:47:42: RtnFrmPrcRcyRd for arm=2 mem=d0db2402 stripe=50c050
type=1 i=1 status=ffff
03/25  3:47:42: RtnFrmPrcRcyRd for arm=3 mem=d0db2413 stripe=50c051
type=1 i=2 status=ffff
03/25  3:47:42: RtnFrmPrcRcyRd for arm=0 mem=d0db23e0 stripe=50c053
type=1 i=4 status=ffff
03/25  3:47:43: REC:log MedErr on pd[1] #retries=0
03/25  3:47:43: LogSense: pd=01, cdb=28 00 14 30 14 e6 00 00 01 00 
03/25  3:47:43:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:47:43: REC: MedErr on LD[1] BadLba=143014e6
03/25  3:47:43: REC:log MedErr on pd[2] #retries=0
03/25  3:47:43: LogSense: pd=02, cdb=28 00 14 30 14 e6 00 00 01 00 
03/25  3:47:43:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:47:43: REC: MedErr on LD[2] BadLba=143014e6
03/25  3:47:43: org resgnBlk=e6
03/25  3:47:43: org resgnBlk=ffff
03/25  3:47:43: PatrolReadStart: Log MedErr pd=04 errLBA=143014e6 in
NVRAM_EVENT_LOG
03/25  3:48:06: REC:log MedErr on pd[2] #retries=0
03/25  3:48:06: LogSense: pd=02, cdb=2f 00 14 2f a7 92 00 ff ff 00 
03/25  3:48:06:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:48:06: REC: MedErr on LD[2] BadLba=143014e6
03/25  3:48:06: prCallback: Medium Error on pd=02, StartLba=142fa792,
ErrLba=143014e6
03/25  3:48:06: prRecQueue: starting pd=02 recovery - blocking host
commands
03/25  3:48:06: prRecGo: Ready to attempt recovery errLBA=143014e6 on
pd=02
03/25  3:48:06: prGetLDInfo: MediaErr in ld=0, span=0, arm=2
03/25  3:48:06: prRecGo: dataErr found on ld 0 span 0 arm 2
03/25  3:48:06: prRecGo: data NOT in cache; cacheLn=24a, row=143014,
stripe=50c050, refBlk=e6, type=1
03/25  3:48:06: prRecGo: R5-get cacheLn=24a, c_ptr=d0dcc646 mem=d0d945e2
& setup cInx=10e c=d0d1f9fc
03/25  3:48:06: RtnFrmPrcRcyRd for arm=1 mem=d0d98e11 stripe=143014
type=2 i=0 status=ffff
03/25  3:48:06: RtnFrmPrcRcyRd for arm=3 mem=d0d98e33 stripe=50c051
type=1 i=2 status=ffff
03/25  3:48:06: RtnFrmPrcRcyRd for arm=4 mem=d0d98e44 stripe=50c052
type=1 i=3 status=ffff
03/25  3:48:06: RtnFrmPrcRcyRd for arm=0 mem=d0d98e00 stripe=50c053
type=1 i=4 status=ffff
03/25  3:48:07: REC:log MedErr on pd[1] #retries=0
03/25  3:48:07: LogSense: pd=01, cdb=28 00 14 30 14 e6 00 00 01 00 
03/25  3:48:07:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:48:07: REC: MedErr on LD[1] BadLba=143014e6
03/25  3:48:07: REC:log MedErr on pd[4] #retries=0
03/25  3:48:07: LogSense: pd=04, cdb=28 00 14 30 14 e6 00 00 01 00 
03/25  3:48:07:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:48:07: REC: MedErr on LD[4] BadLba=143014e6
03/25  3:48:07: org resgnBlk=e6
03/25  3:48:07: org resgnBlk=ffff
03/25  3:48:07: PatrolReadStart: Log MedErr pd=02 errLBA=143014e6 in
NVRAM_EVENT_LOG
03/25  3:48:09: REC:log MedErr on pd[1] #retries=0
03/25  3:48:09: LogSense: pd=01, cdb=2f 00 14 30 00 d2 00 ff ff 00 
03/25  3:48:09:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:48:09: REC: MedErr on LD[1] BadLba=143014e6
03/25  3:48:09: prCallback: Medium Error on pd=01, StartLba=143000d2,
ErrLba=143014e6
03/25  3:48:10: prRecQueue: starting pd=01 recovery - blocking host
commands
03/25  3:48:10: prRecGo: Ready to attempt recovery errLBA=143014e6 on
pd=01
03/25  3:48:10: prGetLDInfo: MediaErr in ld=0, span=0, arm=1
03/25  3:48:10: prRecGo: dataErr found on ld 0 span 0 arm 1
03/25  3:48:10: prRecGo: data NOT in cache; cacheLn=99, row=143014,
stripe=143014, refBlk=e6, type=2
03/25  3:48:10: prRecGo: R5-get cacheLn=99, c_ptr=d0dbabcf mem=d0d92c51
& setup cInx=102 c=d0d1f324
03/25  3:48:10: RtnFrmPrcRcyRd for arm=2 mem=d0d9d002 stripe=50c050
type=1 i=1 status=ffff
03/25  3:48:10: RtnFrmPrcRcyRd for arm=3 mem=d0d9d013 stripe=50c051
type=1 i=2 status=ffff
03/25  3:48:10: RtnFrmPrcRcyRd for arm=4 mem=d0d9d024 stripe=50c052
type=1 i=3 status=ffff
03/25  3:48:10: RtnFrmPrcRcyRd for arm=0 mem=d0d9cfe0 stripe=50c053
type=1 i=4 status=ffff
03/25  3:48:11: REC:log MedErr on pd[2] #retries=0
03/25  3:48:11: LogSense: pd=02, cdb=28 00 14 30 14 e6 00 00 01 00 
03/25  3:48:11:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:48:11: REC: MedErr on LD[2] BadLba=143014e6
03/25  3:48:11: REC:log MedErr on pd[4] #retries=0
03/25  3:48:11: LogSense: pd=04, cdb=28 00 14 30 14 e6 00 00 01 00 
03/25  3:48:11:           sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11
01 00 00 00 3f 
03/25  3:48:11: REC: MedErr on LD[4] BadLba=143014e6
03/25  3:48:11: org resgnBlk=e6
03/25  3:48:11: org resgnBlk=ffff
03/25  3:48:11: PatrolReadStart: Log MedErr pd=01 errLBA=143014e6 in
NVRAM_EVENT_LOG

Any help is greatly appreciated,

-- Curt