From: Les Mikesell Sent: October 7, 2007 16:57
Hi Les. Thanks for your response.
Hugh E Cruickshank wrote:
I now find myself in the situation where I have a failed drive on a non-MegaRAID controller, specifically an Adaptec 29160 SCSI controller. The system is an Acer G700 with 8 internal hot-swappable SCSI drives arranged in two banks of 4 drives. Each bank is connected to a separate channel on the 29160 controller. When I installed CentOS 4 I enable software mirroring between the two banks so that I ended up with 4 pairs of mirrored drive (sda/sde, sdb/sdf, sdc/sdg, sdd/sdh).
Normally with software mirroring you would mirror partitions, not drives. What does "cat /proc/mdstat" say about them?
You are correct. I keep falling back to thinking the "MegaRAID" way where I have the drives mirrored at the controller level and then partitioned at the software level. The /proc/mdstat reports:
Personalities : [raid0] [raid1] md1 : active raid1 sde2[1] sda2[2](F) 8193024 blocks [2/1] [_U]
md2 : active raid1 sde3[1] sda3[2](F) 2048192 blocks [2/1] [_U]
md3 : active raid1 sde5[1] sda5[2](F) 25085376 blocks [2/1] [_U]
md4 : active raid1 sdf1[1] sdb1[0] 35840896 blocks [2/2] [UU]
md5 : active raid1 sdg1[1] sdc1[0] 35840896 blocks [2/2] [UU]
md6 : active raid1 sdh1[1] sdd1[0] 35840896 blocks [2/2] [UU]
md7 : active raid0 sdn1[5] sdm1[4] sdl1[3] sdk1[2] sdj1[1] sdi1[0] 213261312 blocks 256k chunks
md0 : active raid1 sde1[1] sda1[2](F) 513984 blocks [2/1] [_U]
unused devices: <none>
In this configuration sda-sdh are the 29160 attached drives while sdi-sdn are hardware mirrored drive pairs attached to a MegaRAID controller.
The problem I have now is that it is sda (the boot drive) that has failed. I have not encountered this problem before and therefore I need to make sure that I understand what I need to do before I start mucking around with things and dig myself into a deeper hole.
I have spent much time attempting to research the problem but have not been able to come with any definite information to help. As far as I can see I have two options...
Option 1: Leave the system running and replace the drive. Then either the RAID software will re-sync the drives or I can manually sync them with mdadm. I have not seen anything that will support this option but I am hoping that it is a valid option.
This should work, but you'll probably have to tell the controller that you are removing and adding disks. This used to be done by writing something to /proc/scsi/scsi, but it may have changed and also may be controller specific so I'll let someone else point out the documentation for that.
I am glade to hear that. I thought it might be the case but I just did not fell up to trying it by yanking out my boot drive while the system was up and running. That just sounded like a recipe for disaster if I did not have some valid reasoning behind the move. I will wait to see if anyone else weighs in on the subject with some pointers to actual documentation.
Option 2: Create a boot disk (floppy or CD) that I can boot from but that points to sde (the boot mirror). Shutdown the system and replace the failed sda drive. Boot from the new boot disk. Format, partition and re-sync the new sda from sde. Shutdown, remove the boot disk, and reboot from the new sda.
You have an odd combination of drives... Normally you would want to mirror the partitions on the first 2 disks and install grub on both, in which case the system would still boot. Some of the more sophisticated controllers can boot from more than the first 2, though. Anyway, you should be able to boot from your install CD with 'linux rescue' at the boot prompt and get to a point where you can fix things.
The odd combination of drives was actually intentional on my part. The idea was to provide "separation" between the mirrors. While I did not have separate controllers I thought that using the separate channels on the common controller might provide a shade more resiliency. It was my first attempt at setting up mirrored pairs on a non-MegaRAID SCSI controller. Live and learn!
I will read up on the "linux rescue" so, if I have to fallback on this method, I will be able to have a firm plan in place before I start the work.
This particular system is our primary development system and does not get all the "fancy" hardware that our production systems do. I have configured the production systems using only the MegaRAID controllers and there it is a "no brainer" to replace failed drives - just swap the drive and away you go.
Thanks again for your comments. They are greatly appreciated.
Regards, Hugh