[CentOS] Replacing failed software RAID drive

Mon Oct 8 00:59:54 UTC 2007
Hugh E Cruickshank <hugh at forsoft.com>

From: Les Mikesell Sent: October 7, 2007 16:57
> 

Hi Les. Thanks for your response.

> Hugh E Cruickshank wrote:
> > 
> > I now find myself in the situation where I have a failed drive on a
> > non-MegaRAID controller, specifically an Adaptec 29160 SCSI
> > controller.
> > The system is an Acer G700 with 8 internal hot-swappable SCSI drives
> > arranged in two banks of 4 drives. Each bank is connected to a 
> > separate channel on the 29160 controller. When I installed CentOS 4
> > I enable software mirroring between the two banks so that I ended up
> > with 4 pairs of mirrored drive (sda/sde, sdb/sdf, sdc/sdg, sdd/sdh).
> 
> Normally with software mirroring you would mirror partitions, not 
> drives.  What does "cat /proc/mdstat" say about them?

You are correct. I keep falling back to thinking the "MegaRAID" way
where I have the drives mirrored at the controller level and then
partitioned at the software level. The /proc/mdstat reports:

Personalities : [raid0] [raid1]
md1 : active raid1 sde2[1] sda2[2](F)
      8193024 blocks [2/1] [_U]

md2 : active raid1 sde3[1] sda3[2](F)
      2048192 blocks [2/1] [_U]

md3 : active raid1 sde5[1] sda5[2](F)
      25085376 blocks [2/1] [_U]

md4 : active raid1 sdf1[1] sdb1[0]
      35840896 blocks [2/2] [UU]

md5 : active raid1 sdg1[1] sdc1[0]
      35840896 blocks [2/2] [UU]

md6 : active raid1 sdh1[1] sdd1[0]
      35840896 blocks [2/2] [UU]

md7 : active raid0 sdn1[5] sdm1[4] sdl1[3] sdk1[2] sdj1[1] sdi1[0]
      213261312 blocks 256k chunks

md0 : active raid1 sde1[1] sda1[2](F)
      513984 blocks [2/1] [_U]

unused devices: <none>

In this configuration sda-sdh are the 29160 attached drives while
sdi-sdn are hardware mirrored drive pairs attached to a MegaRAID
controller.

> 
> > The problem I have now is that it is sda (the boot drive) that has
> > failed. I have not encountered this problem before and therefore I
> > need to make sure that I understand what I need to do before I start
> > mucking around with things and dig myself into a deeper hole.
> > 
> > I have spent much time attempting to research the problem but have
> > not
> > been able to come with any definite information to help. As far as I
> > can see I have two options...
> > 
> > Option 1: Leave the system running and replace the drive. Then either
> > the RAID software will re-sync the drives or I can manually sync them
> > with mdadm. I have not seen anything that will support this option
> > but I am hoping that it is a valid option.
> 
> This should work, but you'll probably have to tell the controller that 
> you are removing and adding disks.  This used to be done by writing 
> something to /proc/scsi/scsi, but it may have changed and also may be 
> controller specific so I'll let someone else point out the
> documentation for that.

I am glade to hear that. I thought it might be the case but I just did
not fell up to trying it by yanking out my boot drive while the system
was up and running. That just sounded like a recipe for disaster if I
did not have some valid reasoning behind the move. I will wait to see
if anyone else weighs in on the subject with some pointers to actual
documentation.

> 
> > Option 2: Create a boot disk (floppy or CD) that I can boot from but
> > that points to sde (the boot mirror). Shutdown the system and replace 
> > the failed sda drive. Boot from the new boot disk. Format, partition
> > and re-sync the new sda from sde. Shutdown, remove the boot disk, and
> > reboot from the new sda.
> 
> You have an odd combination of drives... Normally you would want to 
> mirror the partitions on the first 2 disks and install grub on both, in 
> which case the system would still boot.  Some of the more sophisticated 
>   controllers can boot from more than the first 2, though.  Anyway, you 
> should be able to boot from your install CD with 'linux rescue' at the 
> boot prompt and get to a point where you can fix things.
> 

The odd combination of drives was actually intentional on my part. The
idea was to provide "separation" between the mirrors. While I did not
have separate controllers I thought that using the separate channels 
on the common controller might provide a shade more resiliency. It was
my first attempt at setting up mirrored pairs on a non-MegaRAID SCSI
controller. Live and learn!

I will read up on the "linux rescue" so, if I have to fallback on this
method, I will be able to have a firm plan in place before I start the
work.

This particular system is our primary development system and does not
get all the "fancy" hardware that our production systems do. I have
configured the production systems using only the MegaRAID controllers
and there it is a "no brainer" to replace failed drives - just swap
the drive and away you go.

Thanks again for your comments. They are greatly appreciated.

Regards, Hugh

-- 
Hugh E Cruickshank, Forward Software, www.forward-software.com