Hi all
If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following: - identify which physical drive it is - replace the drive - add the new drive to the array and force it to re-sync
Thanks in advance
On 2014-05-10, CS_DBA cs_dba@consistentstate.com wrote:
If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following:
- identify which physical drive it is
This is controller dependent. Some support blinking the drive light to identify it, others do not. If yours does not you need to jury-rig something (e.g., either physically label the drive slot/drive, or send some dummy data to the drive to get it to blink).
- replace the drive
The md part is easy. If md hasn't failed the drive already, then you need to do that first:
mdadm /dev/mdN --fail /dev/sdXX
Then remove it from the array:
mdadm /dev/mdN --remove /dev/sdXX
The physical part is, again, hardware dependent.
- add the new drive to the array and force it to re-sync
Again, physical part hardware dependent. Once the kernel knows about your new drive, this should work (partition the drive if needed beforehand):
mdadm /dev/mdN --add /dev/sdYY
There may be extra parameters for replacing a failed RAID10 drive, but I suspect that md already knows the needed parameters, so just adding the drive should kick off a rebuild of the failed member.
On 10.05.2014 19:06, Keith Keller wrote:
On 2014-05-10, CS_DBA cs_dba@consistentstate.com wrote:
If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following:
- identify which physical drive it is
This is controller dependent. Some support blinking the drive light to identify it, others do not. If yours does not you need to jury-rig something (e.g., either physically label the drive slot/drive, or send some dummy data to the drive to get it to blink).
This can also be inverted especially if you cannot send data to the drive anymore because it dies completely: Create lots of disk i/o with a command like "grep -nri test /usr" and all drives except the broken one should show activity.
Another way is to write down the serial numbers of the disks, the slots you put the disks in and then use hdparm -I /dev/sdX to find which device shows which serial number. That way once sdX dies you can check the list to find which slot the disk for the failed device was put in.
Regards, Dennis
On 2014-05-10, Dennis Jacobfeuerborn dennisml@conversis.de wrote:
This can also be inverted especially if you cannot send data to the drive anymore because it dies completely: Create lots of disk i/o with a command like "grep -nri test /usr" and all drives except the broken one should show activity.
That's certainly a good idea. If you have multiple arrays you'd need to send that IO to each array at mostly the same time, but with only one array it's less difficult. I think the most challenging scenario would be if the array has multiple spares--if the array rebuilds before you can look at it, then you have to generate IO on the array and on the drive(s) that are still spares.
If you have no active spares (either you started with none, or you had one and it's been used to replace the dead drive), one way to make IO is to start a check of the md array (e.g., echo check > /sys/block/mdN/md/sync_action ). The drive that doesn't blink is the dead one.
Another way is to write down the serial numbers of the disks, the slots you put the disks in and then use hdparm -I /dev/sdX to find which device shows which serial number. That way once sdX dies you can check the list to find which slot the disk for the failed device was put in.
Physical labelling in this way (or some other way) is still the best solution, as long as you keep the list up to date (and don't screw up the list, of course). But it's definitely good to have multiple methods in your toolbox--for example, you might try the IO trick, then cross-check it against your physical labels. Better to take some extra time verifying which drive is dead than to pull the wrong one!
--keith