On Fri, 2006-01-06 at 16:24, Bryan J. Smith wrote:
Or at least the typical hardware/driver errors aren't fatal.
I think you, and most software RAID users, continue to miss the _root_ cause. If you yank a drive out of a system, one that is being used _actively_, you are going to get a kernel panic. I've seen it on ATA and SCSI. It's _not_ a driver issue. It's the fact that you've lost a resource.
The MD code does _not_ handle this. You have to tie into the hotplug system for 2.6 to hide the device's status from the MD code.
Now maybe some SCSI drivers handle it differently. But it is _not_ a driver issue.
I don't understand this distinction. The kernel calls the driver which talks to the controller. There should be a timeout around this and the controller's response or the timeout should be fielded by the driver. How can it not be a driver issue unless the controller actually locks the PC bus (which may be the case with the motherboard IDE controllers - they generally won't boot with a bad drive either). You don't want to hide the status from the MD code - you want the md driver to kick the device out when it has problems.