I'm using Centos 4.5 right now, and I had a RAID 5 array stop because two drives became unavailable. After adjusting the cables on several occasions and shutting down and restarting, I was able to see the drives again. This is when I snatched defeat from the jaws of victory. Please, someone with vast knowledge of how RAID 5 with mdadm works, tell me if I have any chance at all that this array will pull through with most or all of my data.
Background info about the machine /dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2 /dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2 /dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j]
/dev/sdi and /dev/sdj were the drives that detached from the array and were marked as faulty.
I did the following things that in hindsight were probably VERY BAD
Step 1 (Misassign drives to wrong array): I could probably have had things going again in a tenth of a second if I hadn't typed this: mdadm --manage --add /dev/md0 /dev/sdi mdadm --manage --add /dev/md0 /dev/sdi
This clobbered the superblock and replaced it with that of /dev/md0, yes? well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow.
Ok, so what next? Step 2 (rebuild the array but make sure the params are right!): I wipe out the superblocks on all of the drives in the array and rebuild with --assume-clean for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/ sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
ok, now it says that the array is recovering and will take about 10 hours to rebulid. /dev/sd[c-i] say that they are "active sync" and /dev/sdj says it's a spare that's rebuilding. But now I scroll back in my history and see that oops, the chunk size is WRONG. Not only that, but I don't stop the array until the rebuild is at around 8%
Ok, I stop the array and rebuild with mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid- devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/ sdi /dev/sdj
Now it says it's going to take another 10 hours to rebuild.
How likely are my data irretrievable/gone and at what step would it have happened if so?
Sorry about that, my previous e-mail had just '--chunk' toward the bottom. It should have been '--chunk=256' Please see the quoted snippet for detail.
On Apr 17, 2008, at 1:01 PM, Mark Hennessy wrote:
Ok, I stop the array and rebuild with mdadm --create /dev/md2 --assume-clean --level=5 --chunk=256 --raid- devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/ sdi /dev/sdj
Mark Hennessy wrote:
I'm using Centos 4.5 right now, and I had a RAID 5 array stop because two drives became unavailable. After adjusting the cables on several occasions and shutting down and restarting, I was able to see the drives again. This is when I snatched defeat from the jaws of victory. Please, someone with vast knowledge of how RAID 5 with mdadm works, tell me if I have any chance at all that this array will pull through with most or all of my data.
It may be possible...
Background info about the machine /dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2 /dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2 /dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j]
/dev/sdi and /dev/sdj were the drives that detached from the array and were marked as faulty.
I did the following things that in hindsight were probably VERY BAD
Step 1 (Misassign drives to wrong array): I could probably have had things going again in a tenth of a second if I hadn't typed this: mdadm --manage --add /dev/md0 /dev/sdi mdadm --manage --add /dev/md0 /dev/sdi
This clobbered the superblock and replaced it with that of /dev/md0, yes? well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow.
Hmm, not good, but we will mark this drive 'sdi' as bad.
Ok, so what next? Step 2 (rebuild the array but make sure the params are right!): I wipe out the superblocks on all of the drives in the array and rebuild with --assume-clean for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/ sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
Nooo, you need to make sure sdi is marked as 'bad' offline, you are going to need to assemble the array degraded, then add sdi as a replacement and let it rebuild sdi off the parity.
ok, now it says that the array is recovering and will take about 10 hours to rebulid. /dev/sd[c-i] say that they are "active sync" and /dev/sdj says it's a spare that's rebuilding. But now I scroll back in my history and see that oops, the chunk size is WRONG. Not only that, but I don't stop the array until the rebuild is at around 8%
Well, now I think it's all messed up.
Ok, I stop the array and rebuild with mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid- devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/ sdi /dev/sdj
Now it says it's going to take another 10 hours to rebuild.
It's truly hosed now.
How likely are my data irretrievable/gone and at what step would it have happened if so?
I hope you have backups cause your going to need them.
If only you posted to the list BEFORE you tried to recover it without knowing what to do.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Thanks for answering my e-mail!!
On Apr 17, 2008, at 1:50 PM, Ross S. W. Walker wrote:
Mark Hennessy wrote:
ok, now it says that the array is recovering and will take about 10 hours to rebulid. /dev/sd[c-i] say that they are "active sync" and /dev/sdj says it's a spare that's rebuilding. But now I scroll back in my history and see that oops, the chunk size is WRONG. Not only that, but I don't stop the array until the rebuild is at around 8%
Well, now I think it's all messed up.
Ok, I stop the array and rebuild with mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid- devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/ sdi /dev/sdj
Now it says it's going to take another 10 hours to rebuild.
It's truly hosed now.
I was thinking that too, but I waited until the drive was about 5% recovered and mounted it read-only. It mounted successfully. I was able to cat log files stored there as well as do full listings of tarballs there without interruption. I went ahead and copied a bunch of important things off of that array onto another one and received no complaints from the OS.
What did I miss? I just want to learn and to understand. Perhaps there is documentation that I didn't find via Google and Wikipedia that would explain in more detail how this works that you could direct me to.
Thanks for your kind assistance!
How likely are my data irretrievable/gone and at what step would it have happened if so?
I hope you have backups cause your going to need them.
What's the likelihood of data corruption despite the fs being browsable and the files accessible like I describe?
If only you posted to the list BEFORE you tried to recover it without knowing what to do.
Agreed (strongly).
-Ross
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.