Hi Folks -
Using CentOS on a server destined to have a dozen SATA drives in it. The server is fine, raid 5 is set up on groups of 4 SATA drives.
Today we decide to disconnect one SATA drive to simulate a failure. The box trucked on fine... a little too fine. We waited some minutes but no problem was visible in /proc/mdstat or in /var/log/messages or on the console.
I ran mdadm --monitor /dev/md0 and no problem was shown.
We rebooted still without the drive and finally mdadm --monitor reported that the array was running in a degraded state.
We reconnected the SATA drive and still nothing was reported and nothing happened with the raid state according to /proc/mdstat.
I expected the box to keep on trucking but to become freaked out with warnings all over the shop. What should I have expected in this case and what should I do to become aware of evil events like the drive melting remotely?
-Andy
On Tue, 11 Apr 2006 at 6:36pm, Andy Green wrote
Using CentOS on a server destined to have a dozen SATA drives in it. The server is fine, raid 5 is set up on groups of 4 SATA drives.
Today we decide to disconnect one SATA drive to simulate a failure. The box trucked on fine... a little too fine. We waited some minutes but no problem was visible in /proc/mdstat or in /var/log/messages or on the console.
I ran mdadm --monitor /dev/md0 and no problem was shown.
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
To really test it, I'd disconnect the drive while you have something disk intensive running. I like http://people.redhat.com/dledford/memtest.html, which unpacks and then diffs multiple copies of the Linux source tree. It'll have the system stressed *and* let you know if there any problems with the array running in degraded mode.
Joshua Baker-LePain wrote:
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.
Only at the end of the shutdown did we see some SCSI IO errors, it tried a few times to flush and then gave up and completed the reboot.
To really test it, I'd disconnect the drive while you have something disk intensive running. I like http://people.redhat.com/dledford/memtest.html, which unpacks and then diffs multiple copies of the Linux source tree. It'll have the system stressed *and* let you know if there any problems with the array running in degraded mode.
That'd for sure stress it :-) But if a cable falls out or a drive shoots out of the box high into the night sky, I would expect to hear about it at least from the logs next time I tried to write one byte.
-Andy
Andy Green wrote:
Joshua Baker-LePain wrote:
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.
Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.
''...The error handling is very simple, but at this stage that is an advantage. Error handling code anywhere is inevitably both complex and sorely under-tested. libata error handling is intentionally simple. Positives: Easy to review and verify correctness. Never data corruption. Negatives: if an error occurs, libata will simply send the error back the block layer. There are limited retries by the block layer, depending on the type of error, but there is never a bus reset.
Or in other words: "it's better to stop talking to the disk than compound existing problems with further problems."
As Serial ATA matures, and host- and device-side errata become apparent, the error handling will be slowly refined. I am planning to work with a few (kind!) disk vendors, to obtain special drives/firmwares that allow me to inject faults, and otherwise exercise error handling code.
Error handling improvements will almost certainly be required in order to implement features such as device hotplug. ...''
On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote
Andy Green wrote:
Joshua Baker-LePain wrote:
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.
Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.
And *that's* why I use 3ware...
Joshua Baker-LePain wrote:
On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote
Andy Green wrote:
Joshua Baker-LePain wrote:
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.
Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.
And *that's* why I use 3ware...
Same here. Though I did use software RAID5 for a number of years back in the 96-2000 timeframe and never lost any data, even on very busy production mail servers. But as I look back, I think I was just remarkably lucky. 8-)
Cheers,
Chris Mauritz spake the following on 4/11/2006 11:24 AM:
Joshua Baker-LePain wrote:
On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote
Andy Green wrote:
Joshua Baker-LePain wrote:
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.
Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.
And *that's* why I use 3ware...
Same here. Though I did use software RAID5 for a number of years back in the 96-2000 timeframe and never lost any data, even on very busy production mail servers. But as I look back, I think I was just remarkably lucky. 8-)
Cheers,
If it was on PATA hardware, the error code is pretty much refined. But libata is not anywhere near this level of maturity.
Scott Silva wrote:
Chris Mauritz spake the following on 4/11/2006 11:24 AM:
Joshua Baker-LePain wrote:
On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote
Andy Green wrote:
Joshua Baker-LePain wrote:
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.
Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.
And *that's* why I use 3ware...
Same here. Though I did use software RAID5 for a number of years back in the 96-2000 timeframe and never lost any data, even on very busy production mail servers. But as I look back, I think I was just remarkably lucky. 8-)
Cheers,
If it was on PATA hardware, the error code is pretty much refined. But libata is not anywhere near this level of maturity.
It was a combination of PATA disks and other machines with piles of 9gig and 18gig SCSI barracudas.
Cheers,
hardware raid is a good thing..:)
Andy Green wrote:
Andy Green wrote:
Joshua Baker-LePain wrote:
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.
Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.
Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.
''...The error handling is very simple, but at this stage that is an advantage. Error handling code anywhere is inevitably both complex and sorely under-tested. libata error handling is intentionally simple. Positives: Easy to review and verify correctness. Never data corruption. Negatives: if an error occurs, libata will simply send the error back the block layer. There are limited retries by the block layer, depending on the type of error, but there is never a bus reset.
Or in other words: "it's better to stop talking to the disk than compound existing problems with further problems."
As Serial ATA matures, and host- and device-side errata become apparent, the error handling will be slowly refined. I am planning to work with a few (kind!) disk vendors, to obtain special drives/firmwares that allow me to inject faults, and otherwise exercise error handling code.
Error handling improvements will almost certainly be required in order to implement features such as device hotplug. ...''
http://linux-ata.org/software-status.html
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos