SATA Raid 5 and losing a drive

List overview All Threads
Download

newer

older

my email address???

Postfix on VLANs

Andy Green

11 Apr 2006 11 Apr '06

5:36 p.m.

Hi Folks -

Using CentOS on a server destined to have a dozen SATA drives in it. The server is fine, raid 5 is set up on groups of 4 SATA drives.

Today we decide to disconnect one SATA drive to simulate a failure. The box trucked on fine... a little too fine. We waited some minutes but no problem was visible in /proc/mdstat or in /var/log/messages or on the console.

I ran mdadm --monitor /dev/md0 and no problem was shown.

We rebooted still without the drive and finally mdadm --monitor reported that the array was running in a degraded state.

We reconnected the SATA drive and still nothing was reported and nothing happened with the raid state according to /proc/mdstat.

I expected the box to keep on trucking but to become freaked out with warnings all over the shop. What should I have expected in this case and what should I do to become aware of evil events like the drive melting remotely?

-Andy

Attachments:

smime.p7s (application/x-pkcs7-signature — 4.4 KB)

Show replies by date

Joshua Baker-LePain

11 Apr 11 Apr

5:42 p.m.

On Tue, 11 Apr 2006 at 6:36pm, Andy Green wrote

...

Using CentOS on a server destined to have a dozen SATA drives in it. The server is fine, raid 5 is set up on groups of 4 SATA drives.

Today we decide to disconnect one SATA drive to simulate a failure. The box trucked on fine... a little too fine. We waited some minutes but no problem was visible in /proc/mdstat or in /var/log/messages or on the console.

I ran mdadm --monitor /dev/md0 and no problem was shown.

Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

To really test it, I'd disconnect the drive while you have something disk intensive running. I like http://people.redhat.com/dledford/memtest.html, which unpacks and then diffs multiple copies of the Linux source tree. It'll have the system stressed *and* let you know if there any problems with the array running in degraded mode.

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Andy Green

5:47 p.m.

Joshua Baker-LePain wrote:

...

Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.

Only at the end of the shutdown did we see some SCSI IO errors, it tried a few times to flush and then gave up and completed the reboot.

...

To really test it, I'd disconnect the drive while you have something disk intensive running. I like http://people.redhat.com/dledford/memtest.html, which unpacks and then diffs multiple copies of the Linux source tree. It'll have the system stressed *and* let you know if there any problems with the array running in degraded mode.

That'd for sure stress it :-) But if a cable falls out or a drive shoots out of the box high into the night sky, I would expect to hear about it at least from the logs next time I tried to write one byte.

-Andy

Andy Green

6:07 p.m.

Andy Green wrote:

...

Joshua Baker-LePain wrote:

...
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.

Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.

''...The error handling is very simple, but at this stage that is an advantage. Error handling code anywhere is inevitably both complex and sorely under-tested. libata error handling is intentionally simple. Positives: Easy to review and verify correctness. Never data corruption. Negatives: if an error occurs, libata will simply send the error back the block layer. There are limited retries by the block layer, depending on the type of error, but there is never a bus reset.

Or in other words: "it's better to stop talking to the disk than compound existing problems with further problems."

As Serial ATA matures, and host- and device-side errata become apparent, the error handling will be slowly refined. I am planning to work with a few (kind!) disk vendors, to obtain special drives/firmwares that allow me to inject faults, and otherwise exercise error handling code.

Error handling improvements will almost certainly be required in order to implement features such as device hotplug. ...''

http://linux-ata.org/software-status.html

Joshua Baker-LePain

6:11 p.m.

On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote

...

Andy Green wrote:

...
Joshua Baker-LePain wrote:

...
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.

Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.

And *that's* why I use 3ware...

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Chris Mauritz

6:24 p.m.

Joshua Baker-LePain wrote:

...

On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote

...
Andy Green wrote:

...
Joshua Baker-LePain wrote:

...
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.

Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.

And *that's* why I use 3ware...

Same here. Though I did use software RAID5 for a number of years back in the 96-2000 timeframe and never lost any data, even on very busy production mail servers. But as I look back, I think I was just remarkably lucky. 8-)

Cheers,

Scott Silva

6:56 p.m.

Chris Mauritz spake the following on 4/11/2006 11:24 AM:

...

Joshua Baker-LePain wrote:

...
On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote

...
Andy Green wrote:

...
Joshua Baker-LePain wrote:

...
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.

Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.

And *that's* why I use 3ware...

Same here. Though I did use software RAID5 for a number of years back in the 96-2000 timeframe and never lost any data, even on very busy production mail servers. But as I look back, I think I was just remarkably lucky. 8-)

Cheers,

If it was on PATA hardware, the error code is pretty much refined. But libata is not anywhere near this level of maturity.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

Chris Mauritz

7:07 p.m.

Scott Silva wrote:

...

Chris Mauritz spake the following on 4/11/2006 11:24 AM:

...
Joshua Baker-LePain wrote:

...
On Tue, 11 Apr 2006 at 7:07pm, Andy Green wrote

...
Andy Green wrote:

...
Joshua Baker-LePain wrote:

...
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.

Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.

And *that's* why I use 3ware...

Same here. Though I did use software RAID5 for a number of years back in the 96-2000 timeframe and never lost any data, even on very busy production mail servers. But as I look back, I think I was just remarkably lucky. 8-)

Cheers,

If it was on PATA hardware, the error code is pretty much refined. But libata is not anywhere near this level of maturity.

It was a combination of PATA disks and other machines with piles of 9gig and 18gig SCSI barracudas.

Cheers,

William Warren

6:26 p.m.

hardware raid is a good thing..:)

Andy Green wrote:

...

Andy Green wrote:

...
Joshua Baker-LePain wrote:

...
Did you try doing any I/O to the array? In my limited experience with software RAID, it won't notice a drive missing until it tries to do something with said drive.

Yes I did do this, I copied a file to the mountpoint and did a sync. Nothing.

Hm Googling around suggests that everyone with SATA raid may be experiencing the same lack of warning that their safety net just blew a hole through the server farm roof in a bid to reach escape velocity.

''...The error handling is very simple, but at this stage that is an advantage. Error handling code anywhere is inevitably both complex and sorely under-tested. libata error handling is intentionally simple. Positives: Easy to review and verify correctness. Never data corruption. Negatives: if an error occurs, libata will simply send the error back the block layer. There are limited retries by the block layer, depending on the type of error, but there is never a bus reset.

Or in other words: "it's better to stop talking to the disk than compound existing problems with further problems."

As Serial ATA matures, and host- and device-side errata become apparent, the error handling will be slowly refined. I am planning to work with a few (kind!) disk vendors, to obtain special drives/firmwares that allow me to inject faults, and otherwise exercise error handling code.

Error handling improvements will almost certainly be required in order to implement features such as device hotplug. ...''

http://linux-ata.org/software-status.html

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- My "Foundation" verse: Isa 54:17 No weapon that is formed against thee shall prosper; and every tongue that shall rise against thee in judgment thou shalt condemn. This is the heritage of the servants of the LORD, and their righteousness is of me, saith the LORD. -- carpe ductum -- "Grab the tape" CDTT (Certified Duct Tape Technician) Linux user #322099 Machines: 206822 256638 276825 http://counter.li.org/

7032

Age (days ago)

7032

Last active (days ago)

discuss@lists.centos.org

8 comments

5 participants

tags (0)

participants (5)

Andy Green
Chris Mauritz
Joshua Baker-LePain
Scott Silva
William Warren