Joshua Baker-LePain jlb17@duke.edu wrote:
I'm running an all software RAID50 ... This morning I came in to find the system hung. Turns out a disk went overnight on one of the 7500s, and rather than a graceful failover I got this: Jan 6 01:03:58 $SERVER kernel: 3w-xxxx: scsi2: Command failed: status = 0xc7,flags = 0x40, unit #3. Jan 6 01:04:02 $SERVER kernel: 3w-xxxx: scsi2: AEN: ERROR: Drive error: Port #3. Jan 6 01:04:10 $SERVER 3w-xxxx[2781]: ERROR: Drive error encountered on port 3 on controller ID:2. Check cables and drives for media errors. (0xa)
Yes, the drive failed.
Had you used the 3Ware's intelligent hardware RAID, it would have hidden the drive disconnect from the system. You'd see a log entry on the failure, and that the array was in a "downgraded" state.
Instead, you're using software RAID, and it's up to the kernel to not panic on itself because a disk is no longer available. The problem isn't the 3Ware controller, it's the software RAID logic in the kernel.
Any ideas as to what I can do to prevent this in the future?
Use the 3Ware card as it is intended, a hardware RAID card.
Having the system hang every time a disk dies is, well,
less
than optimal.
No joke. It wasn't until even kernel 2.6 that hotplug support was offered, and it still does _not_ work as advertised.
It's stuff like this that makes me want to strangle most advocates of using 3Ware cards with software RAID. There are countless issues like this -- far more than the alleged "hardware lock-in" negative of using hardware RAID.