Hello all,
I have run into a sticky problem with a failed device in an md array, and I asked about it on the linux raid mailing list, but since the problem may not be md-specific, I am hoping to find some insight here. (If you are on the MD list, and are seeing this twice, I humbly apologize.)
The summary is that during a reshape of a raid6 on an up to date CentOS 6.3 box, one disk failed, and was marked as such in the array, but is not allowing me to remove it:
# mdadm /dev/md127 --fail /dev/sdg mdadm: set /dev/sdg faulty in /dev/md127 # mdadm /dev/md127 --remove /dev/sdg mdadm: hot remove failed for /dev/sdg: Device or resource busy
And in dmesg, I get an error like so:
md: cannot remove active disk sdg from md127 ...
More details, including mdadm -D output and other diagnostics, are at http://www.spinics.net/lists/raid/msg41928.html . As I note there, the array seems fine otherwise, but is not currently in active use (so perhaps my options are greater than if I wished to keep it deployed). As the other messages in that thread show, I think I've already done the ''obvious'' steps to try to remove the device from the array.
Checking things out further, I found that it may be that udev did not completely remove the disk, even though the controller no longer believes that the exported unit exists. (udevadm output is here: http://www.spinics.net/lists/raid/msg41950.html ) So my hypothesis is that if I can somehow force udev to drop the references to the disk completely, perhaps I can remove sdg from the array and start a rebuild with the spare already available. I found these docs for Fedora:
https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_Administration_G...
But of course I can't do step 3, since md is refusing to give up sdg. But sdg is already gone, so I really don't care about outstanding IO, and it's a bit too late to worry about a 100% clean removal. So my questions are, will step 7 actually clean up references to sdg, and how likely is it that doing so would let me remove it from the array?
And finally, if the above is not a wise way to go, are there better things to try? If other diagnostic output is desired please let me know. Thanks!
--keith
Hi Keith,
It seems that the mdadm -D indicates the root cause of "device busy":
5 8 96 5 faulty spare rebuilding /dev/sdg
Is there any clue in /proc/mdstat and /var/log/messages?
On 02/11/2013 12:39 PM, Keith Keller wrote:
Hello all,
I have run into a sticky problem with a failed device in an md array, and I asked about it on the linux raid mailing list, but since the problem may not be md-specific, I am hoping to find some insight here. (If you are on the MD list, and are seeing this twice, I humbly apologize.)
The summary is that during a reshape of a raid6 on an up to date CentOS 6.3 box, one disk failed, and was marked as such in the array, but is not allowing me to remove it:
# mdadm /dev/md127 --fail /dev/sdg mdadm: set /dev/sdg faulty in /dev/md127 # mdadm /dev/md127 --remove /dev/sdg mdadm: hot remove failed for /dev/sdg: Device or resource busy
And in dmesg, I get an error like so:
md: cannot remove active disk sdg from md127 ...
More details, including mdadm -D output and other diagnostics, are at http://www.spinics.net/lists/raid/msg41928.html . As I note there, the array seems fine otherwise, but is not currently in active use (so perhaps my options are greater than if I wished to keep it deployed). As the other messages in that thread show, I think I've already done the ''obvious'' steps to try to remove the device from the array.
Checking things out further, I found that it may be that udev did not completely remove the disk, even though the controller no longer believes that the exported unit exists. (udevadm output is here: http://www.spinics.net/lists/raid/msg41950.html ) So my hypothesis is that if I can somehow force udev to drop the references to the disk completely, perhaps I can remove sdg from the array and start a rebuild with the spare already available. I found these docs for Fedora:
https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_Administration_G...
But of course I can't do step 3, since md is refusing to give up sdg. But sdg is already gone, so I really don't care about outstanding IO, and it's a bit too late to worry about a 100% clean removal. So my questions are, will step 7 actually clean up references to sdg, and how likely is it that doing so would let me remove it from the array?
And finally, if the above is not a wise way to go, are there better things to try? If other diagnostic output is desired please let me know. Thanks!
--keith
Hi Vincent,
On 2013-02-11, Vincent Li ruconse@gmail.com wrote:
Hi Keith,
It seems that the mdadm -D indicates the root cause of "device busy":
5 8 96 5 faulty spare rebuilding /dev/sdg
Well, this is one thing I don't quite get. In the past, when a device has been marked faulty (even on this array), md has permitted me to remove it. These occasions were not during a reshape, however. Naively I would think that md would give up IO on a failed device, and so it would no longer be busy. And the dmesg report implies that md thinks the device is still "active" even though it marked it faulty.
Is there any clue in /proc/mdstat and /var/log/messages?
Not really. Here's mdstat:
Personalities : [raid6] [raid5] [raid4] md127 : active raid6 sdm[13](S) sdg[5](F) sdj[8] sdi[7] sdk[10] sdc[1] sdn[12] sdd[2] sde[3] sdf[4] sdh[6] sdb[0] sdl[11] 17578013184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/11] [UUUUU_UUUUUU] resync=PENDING
unused devices: <none>
As you may expect, sdg is set as faulty, and sdm is marked as a spare; in the past, if things were nice, sdg would be removed automatically and a new rebuild would start with sdm.
There isn't anything compelling in messages, either. The only items I see that seem relevant are errors when tools like mdadm -E /dev/sdg reports read errors. This is in fact what led me to look at udevadm info; I expected that mdadm -E would not find anything on sdg, because I thought that sdg no longer existed at all. That's when I found sdg in this limbo state. Though oddly enough, udevinfo has changed:
# udevadm info --name=sdg --query=all P: /devices/pci0000:00/0000:00:0b.0/0000:01:03.0/host2/target2:0:5/2:0:5:0/block/sdg N: sdg W: 102 S: block/8:96 S: disk/by-path/pci-0000:01:03.0-scsi-0:0:5:0 E: UDEV_LOG=3 E: DEVPATH=/devices/pci0000:00/0000:00:0b.0/0000:01:03.0/host2/target2:0:5/2:0:5:0/block/sdg E: MAJOR=8 E: MINOR=96 E: DEVNAME=/dev/sdg E: DEVTYPE=disk E: SUBSYSTEM=block E: MPATH_SBIN_PATH=/sbin E: ID_SCSI=1 E: ID_TYPE=generic E: ID_BUS=scsi E: ID_PATH=pci-0000:01:03.0-scsi-0:0:5:0 E: LVM_SBIN_PATH=/sbin E: DEVLINKS=/dev/block/8:96 /dev/disk/by-path/pci-0000:01:03.0-scsi-0:0:5:0
It no longer thinks there is any connection to the mdraid or the controller, but it's still different from what I'd expect if there were no udev entries for the device at all.
--keith
Hello all,
Just for posterity's sake, I was able to resolve this issue, by stopping the array, reassembling it using --force, and removing the device from udev using the Fedora docs I referenced earlier. I don't know if there is a way to resolve it without stopping the array, but if I read anything on the raid list that might be relevant to CentOS users I will pass it on.
--keith