how to replace a raid drive with mdadm

List overview All Threads
Download

newer

older

EFI and RAID questions

[ask] iodined: open_tun:...

CS_DBA

10 May 2014 10 May '14

4:39 p.m.

Hi all

If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following: - identify which physical drive it is - replace the drive - add the new drive to the array and force it to re-sync

Thanks in advance

Show replies by date

Keith Keller

10 May 10 May

5:06 p.m.

On 2014-05-10, CS_DBA cs_dba@consistentstate.com wrote:

...

If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following:

identify which physical drive it is

This is controller dependent. Some support blinking the drive light to identify it, others do not. If yours does not you need to jury-rig something (e.g., either physically label the drive slot/drive, or send some dummy data to the drive to get it to blink).

...

replace the drive

The md part is easy. If md hasn't failed the drive already, then you need to do that first:

mdadm /dev/mdN --fail /dev/sdXX

Then remove it from the array:

mdadm /dev/mdN --remove /dev/sdXX

The physical part is, again, hardware dependent.

...

add the new drive to the array and force it to re-sync

Again, physical part hardware dependent. Once the kernel knows about your new drive, this should work (partition the drive if needed beforehand):

mdadm /dev/mdN --add /dev/sdYY

There may be extra parameters for replacing a failed RAID10 drive, but I suspect that md already knows the needed parameters, so just adding the drive should kick off a rebuild of the failed member.

-- kkeller@wombat.san-francisco.ca.us

Dennis Jacobfeuerborn

5:29 p.m.

On 10.05.2014 19:06, Keith Keller wrote:

...

On 2014-05-10, CS_DBA cs_dba@consistentstate.com wrote:

...
If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following:

identify which physical drive it is

This is controller dependent. Some support blinking the drive light to identify it, others do not. If yours does not you need to jury-rig something (e.g., either physically label the drive slot/drive, or send some dummy data to the drive to get it to blink).

This can also be inverted especially if you cannot send data to the drive anymore because it dies completely: Create lots of disk i/o with a command like "grep -nri test /usr" and all drives except the broken one should show activity.

Another way is to write down the serial numbers of the disks, the slots you put the disks in and then use hdparm -I /dev/sdX to find which device shows which serial number. That way once sdX dies you can check the list to find which slot the disk for the failed device was put in.

Regards, Dennis

Keith Keller

11:03 p.m.

On 2014-05-10, Dennis Jacobfeuerborn dennisml@conversis.de wrote:

...

This can also be inverted especially if you cannot send data to the drive anymore because it dies completely: Create lots of disk i/o with a command like "grep -nri test /usr" and all drives except the broken one should show activity.

That's certainly a good idea. If you have multiple arrays you'd need to send that IO to each array at mostly the same time, but with only one array it's less difficult. I think the most challenging scenario would be if the array has multiple spares--if the array rebuilds before you can look at it, then you have to generate IO on the array and on the drive(s) that are still spares.

If you have no active spares (either you started with none, or you had one and it's been used to replace the dead drive), one way to make IO is to start a check of the md array (e.g., echo check > /sys/block/mdN/md/sync_action ). The drive that doesn't blink is the dead one.

...

Another way is to write down the serial numbers of the disks, the slots you put the disks in and then use hdparm -I /dev/sdX to find which device shows which serial number. That way once sdX dies you can check the list to find which slot the disk for the failed device was put in.

Physical labelling in this way (or some other way) is still the best solution, as long as you keep the list up to date (and don't screw up the list, of course). But it's definitely good to have multiple methods in your toolbox--for example, you might try the IO trick, then cross-check it against your physical labels. Better to take some extra time verifying which drive is dead than to pull the wrong one!

--keith

-- kkeller@wombat.san-francisco.ca.us

4118

Age (days ago)

4118

Last active (days ago)

discuss@lists.centos.org

3 comments

3 participants

tags (0)

participants (3)

CS_DBA
Dennis Jacobfeuerborn
Keith Keller