[CentOS] 3ware disk failure -> hang

Fri Jan 6 17:39:38 UTC 2006
Bryan J. Smith <thebs413 at earthlink.net>

Joshua Baker-LePain <jlb17 at duke.edu> wrote:
> I'm running an all software RAID50 ...
> This morning I came in to find the system hung.
> Turns out a disk went overnight on one of the 7500s,
> and rather than a graceful failover I got this:
> Jan  6 01:03:58 $SERVER kernel: 3w-xxxx: scsi2: Command
> failed: status = 0xc7,flags = 0x40, unit #3.
> Jan  6 01:04:02 $SERVER kernel: 3w-xxxx: scsi2: AEN: ERROR:
> Drive error: Port #3.
> Jan  6 01:04:10 $SERVER 3w-xxxx[2781]: ERROR: Drive error
> encountered on port 3 on controller ID:2. Check cables and
> drives for media errors. (0xa)

Yes, the drive failed.

Had you used the 3Ware's intelligent hardware RAID, it would
have hidden the drive disconnect from the system.  You'd see
a log entry on the failure, and that the array was in a
"downgraded" state.

Instead, you're using software RAID, and it's up to the
kernel to not panic on itself because a disk is no longer
available.  The problem isn't the 3Ware controller, it's the
software RAID logic in the kernel.

> Any ideas as to what I can do to prevent this in the
> future?

Use the 3Ware card as it is intended, a hardware RAID card.

> Having the system hang every time a disk dies is, well,
less
> than optimal.

No joke.  It wasn't until even kernel 2.6 that hotplug
support was offered, and it still does _not_ work as
advertised.

It's stuff like this that makes me want to strangle most
advocates of using 3Ware cards with software RAID.  There are
countless issues like this -- far more than the alleged
"hardware lock-in" negative of using hardware RAID.


-- 
Bryan J. Smith     Professional, Technical Annoyance                      b.j.smith at ieee.org      http://thebs413.blogspot.com
----------------------------------------------------
*** Speed doesn't kill, difference in speed does ***