RE: [CentOS] Replacing failed software RAID drive

8 Oct 2007


      From: Les Mikesell Sent: October 7, 2007 16:57
...
Hi Les. Thanks for your response.
...
Hugh E Cruickshank wrote:
...
I now find myself in the situation where I have a failed drive on a
non-MegaRAID controller, specifically an Adaptec 29160 SCSI
controller.
The system is an Acer G700 with 8 internal hot-swappable SCSI drives
arranged in two banks of 4 drives. Each bank is connected to a 
separate channel on the 29160 controller. When I installed CentOS 4
I enable software mirroring between the two banks so that I ended up
with 4 pairs of mirrored drive (sda/sde, sdb/sdf, sdc/sdg, sdd/sdh).
Normally with software mirroring you would mirror partitions, not 
drives.  What does "cat /proc/mdstat" say about them?
You are correct. I keep falling back to thinking the "MegaRAID" way
where I have the drives mirrored at the controller level and then
partitioned at the software level. The /proc/mdstat reports:
Personalities : [raid0] [raid1]
md1 : active raid1 sde2[1] sda2[2](F)
      8193024 blocks [2/1] [_U]
md2 : active raid1 sde3[1] sda3[2](F)
      2048192 blocks [2/1] [_U]
md3 : active raid1 sde5[1] sda5[2](F)
      25085376 blocks [2/1] [_U]
md4 : active raid1 sdf1[1] sdb1[0]
      35840896 blocks [2/2] [UU]
md5 : active raid1 sdg1[1] sdc1[0]
      35840896 blocks [2/2] [UU]
md6 : active raid1 sdh1[1] sdd1[0]
      35840896 blocks [2/2] [UU]
md7 : active raid0 sdn1[5] sdm1[4] sdl1[3] sdk1[2] sdj1[1] sdi1[0]
      213261312 blocks 256k chunks
md0 : active raid1 sde1[1] sda1[2](F)
      513984 blocks [2/1] [_U]
unused devices: <none>
In this configuration sda-sdh are the 29160 attached drives while
sdi-sdn are hardware mirrored drive pairs attached to a MegaRAID
controller.
...
...
The problem I have now is that it is sda (the boot drive) that has
failed. I have not encountered this problem before and therefore I
need to make sure that I understand what I need to do before I start
mucking around with things and dig myself into a deeper hole.
I have spent much time attempting to research the problem but have
not
been able to come with any definite information to help. As far as I
can see I have two options...
Option 1: Leave the system running and replace the drive. Then either
the RAID software will re-sync the drives or I can manually sync them
with mdadm. I have not seen anything that will support this option
but I am hoping that it is a valid option.
This should work, but you'll probably have to tell the controller that 
you are removing and adding disks.  This used to be done by writing 
something to /proc/scsi/scsi, but it may have changed and also may be 
controller specific so I'll let someone else point out the
documentation for that.
I am glade to hear that. I thought it might be the case but I just did
not fell up to trying it by yanking out my boot drive while the system
was up and running. That just sounded like a recipe for disaster if I
did not have some valid reasoning behind the move. I will wait to see
if anyone else weighs in on the subject with some pointers to actual
documentation.
...
...
Option 2: Create a boot disk (floppy or CD) that I can boot from but
that points to sde (the boot mirror). Shutdown the system and replace 
the failed sda drive. Boot from the new boot disk. Format, partition
and re-sync the new sda from sde. Shutdown, remove the boot disk, and
reboot from the new sda.
You have an odd combination of drives... Normally you would want to 
mirror the partitions on the first 2 disks and install grub on both, in 
which case the system would still boot.  Some of the more sophisticated 
  controllers can boot from more than the first 2, though.  Anyway, you 
should be able to boot from your install CD with 'linux rescue' at the 
boot prompt and get to a point where you can fix things.
The odd combination of drives was actually intentional on my part. The
idea was to provide "separation" between the mirrors. While I did not
have separate controllers I thought that using the separate channels 
on the common controller might provide a shade more resiliency. It was
my first attempt at setting up mirrored pairs on a non-MegaRAID SCSI
controller. Live and learn!
I will read up on the "linux rescue" so, if I have to fallback on this
method, I will be able to have a firm plan in place before I start the
work.
This particular system is our primary development system and does not
get all the "fancy" hardware that our production systems do. I have
configured the production systems using only the MegaRAID controllers
and there it is a "no brainer" to replace failed drives - just swap
the drive and away you go.
Thanks again for your comments. They are greatly appreciated.
Regards, Hugh
-- 
Hugh E Cruickshank, Forward Software, www.forward-software.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

RE: [CentOS] Replacing failed software RAID drive