We're just using Linux software RAID for the first time - RAID1, and the other day, a drive failed. We have a clone machine to play with, so it's not that critical, but....
I partitioned a replacement drive. On the clone, I marked the RAID partitions on /dev/sda failed, and remove, and pulled the drive. After several iterations, I waited a minute or two, until all messages had stopped, and there was only /dev/sdb*, and then put the new one in... and it appears as /dev/sdc. I don't want to reboot the box, and after googling a bit, it looks as though I *might* be able to use udevadm to change that... but the manpage leaves something to be desired... like a man page. The one that appears interesting to me is udevadm --test --action=<string>... and the actual command that would run, but there is *ZERO* information as to what actions are available, other than the default of "add".
Clues for the poor, folks?
mark
Am 06.12.2011 19:28, schrieb m.roth@5-cent.us:
We're just using Linux software RAID for the first time - RAID1, and the other day, a drive failed. We have a clone machine to play with, so it's not that critical, but....
I partitioned a replacement drive. On the clone, I marked the RAID partitions on /dev/sda failed, and remove, and pulled the drive.
dd if=/dev/sdx of=/dev/sdy bs=512 count=1 reboot
tp close the whole MBR and partition-table
several iterations, I waited a minute or two, until all messages had stopped, and there was only /dev/sdb*, and then put the new one in... and it appears as /dev/sdc.
the device name is totally uninteresting, the IDs are mdadm /dev/mdx --add /dev/sdex
Reindl Harald wrote:
Am 06.12.2011 19:28, schrieb m.roth@5-cent.us:
We're just using Linux software RAID for the first time - RAID1, and the other day, a drive failed. We have a clone machine to play with, so it's not that critical, but....
I partitioned a replacement drive. On the clone, I marked the RAID partitions on /dev/sda failed, and remove, and pulled the drive.
<snip>
several iterations, I waited a minute or two, until all messages had stopped, and there was only /dev/sdb*, and then put the new one in... and it appears as /dev/sdc.
the device name is totally uninteresting, the IDs are mdadm /dev/mdx --add /dev/sdex
No, it's not uninteresting. I can't be sure that when it reboots, it won't come back as /dev/sda. And the few places I find that have howtos on replacing failed RAID drives don't seem to have run into this issue with udev (I assume) and /dev/sda.
mark
On Tuesday, December 06, 2011 02:21:09 PM m.roth@5-cent.us wrote:
Reindl Harald wrote:
the device name is totally uninteresting, the IDs are mdadm /dev/mdx --add /dev/sdex
No, it's not uninteresting. I can't be sure that when it reboots, it won't come back as /dev/sda.
The RAIDsets will be assembled by UUID, not by device name. It doesn't matter whether it comes up as /dev/sda or /dev/sdah or whatever, except for booting purposes, which you'll need to handle manually by making sure all the bootloader sectors (some of which can be outside any partition) are properly copied over; just getting the MBR is not enough in some cases.
I have, on upstream EL 6.1, a box with two 750G drives in RAID1 (they are not the boot drives). The device names for the component devices do not come up the same on every boot; some boots they come up as /dev/sdx and /dev/sdy, and some boots /dev/sdw and /dev/sdab, and others. The /dev/md devices come up fine and get mounted fine even though the component device names are somewhat nondeterministic.
Lamar Owen wrote:
On Tuesday, December 06, 2011 02:21:09 PM m.roth@5-cent.us wrote:
Reindl Harald wrote:
the device name is totally uninteresting, the IDs are mdadm /dev/mdx --add /dev/sdex
No, it's not uninteresting. I can't be sure that when it reboots, it won't come back as /dev/sda.
The RAIDsets will be assembled by UUID, not by device name. It doesn't matter whether it comes up as /dev/sda or /dev/sdah or whatever, except for booting purposes, which you'll need to handle manually by making sure all the bootloader sectors (some of which can be outside any partition) are properly copied over; just getting the MBR is not enough in some cases.
<snip> Booting purposes is the point: /dev/md0 is /boot. And as the slot's ATA0, it should come up as sda. You mention getting the bootloader sectors over - do you mean, after it's rebuilt and active, to then rerun grub-install? Remember, the drive that's failing is /dev/sda. I also don't see why, if I replace the original drive in the original slot (I'm doing this on the clone, whose drive is just fine, thankyouveddymuch), it isn't recognized as /dev/sda.
Another question, for you, or Johnny - where's the source code for udevadm? The documentation doesn't, and I was just trying to yum install the udev-devel so I could look at the source code, and there's no such package.
mark
On Tuesday, December 06, 2011 02:46:24 PM m.roth@5-cent.us wrote:
Booting purposes is the point: /dev/md0 is /boot. And as the slot's ATA0, it should come up as sda. You mention getting the bootloader sectors over
- do you mean, after it's rebuilt and active, to then rerun grub-install?
Either that, or dd the stage1.5 sectors over. See https://en.wikipedia.org/wiki/GNU_GRUB#GRUB_version_1
for more information on where the stage1.5 is located on the disk.
Remember, the drive that's failing is /dev/sda. I also don't see why, if I replace the original drive in the original slot (I'm doing this on the clone, whose drive is just fine, thankyouveddymuch), it isn't recognized as /dev/sda.
I've seen that happen before. It was a tad disconcerting at first, but, yes, in the case I saw it a reboot made it back to sda.
Another question, for you, or Johnny - where's the source code for udevadm? The documentation doesn't, and I was just trying to yum install the udev-devel so I could look at the source code, and there's no such package.
The udev source RPM should contain this. As CentOS 5 doesn't include udevadm, I'm assuming this is CentOS 6, which does, so you'd want to get
http://vault.centos.org/6.0/cr/SRPMS/Packages/udev-147-2.35.el6.src.rpm
which is the latest CentOS source package. (6.2 has a newer one; you could, if you wanted to, go to ftp.redhat.com and grab the src.rpm there).
On Tuesday, December 06, 2011 03:09:42 PM Lamar Owen wrote:
I've seen that happen before. It was a tad disconcerting at first, but, yes, in the case I saw it a reboot made it back to sda.
Oh, and /dev/sda is not necessarily the BIOS boot device, by the way. For instance, on my upstream EL 6.1 box:
[root@www ~]# mount|grep boot /dev/sdag1 on /boot type ext4 (rw) [root@www ~]#
That is not constant; I have seen it at /dev/sda1, and I have seen it everywhere in between.
It will be interesting to see if 6.2 has deterministic behavior in device names.
Lamar Owen wrote:
On Tuesday, December 06, 2011 03:09:42 PM Lamar Owen wrote:
I've seen that happen before. It was a tad disconcerting at first, but, yes, in the case I saw it a reboot made it back to sda.
Oh, and /dev/sda is not necessarily the BIOS boot device, by the way. For instance, on my upstream EL 6.1 box:
[root@www ~]# mount|grep boot /dev/sdag1 on /boot type ext4 (rw) [root@www ~]#
That is not constant; I have seen it at /dev/sda1, and I have seen it everywhere in between.
It will be interesting to see if 6.2 has deterministic behavior in device names.
Yeah, over the years, RH has been non-determistic (thinking of ethx). I don't see why it shouldn't deterministically iterate over the channels, which would make sda, sdb, etc, the same every time.
mark
Lamar Owen wrote:
On Tuesday, December 06, 2011 03:09:42 PM Lamar Owen wrote:
I've seen that happen before. It was a tad disconcerting at first, but, yes, in the case I saw it a reboot made it back to sda.
Right, forgot to mention: I need to partition the drive before I use it, anyway, and have already prepped the replacement drive, and toggled its bootable flag, so, *maybe* I just need to rerun grub-install.
mark
Lamar Owen wrote:
On Tuesday, December 06, 2011 02:46:24 PM m.roth@5-cent.us wrote:
Booting purposes is the point: /dev/md0 is /boot. And as the slot's ATA0, it should come up as sda. You mention getting the bootloader sectors over - do you mean, after it's rebuilt and active, to then rerun grub-install?
Either that, or dd the stage1.5 sectors over. See https://en.wikipedia.org/wiki/GNU_GRUB#GRUB_version_1
for more information on where the stage1.5 is located on the disk.
Well, I dd'd the first meg or so with /dev/zero of both /dev/sdc2 and /dev/sdc3 (what happened to /dev/true and /dev/false?), and readded it, and it's happily rebuilding.
Remember, the drive that's failing is /dev/sda. I also don't see why, if I replace the original drive in the original slot (I'm doing this on the clone, whose drive is just fine, thankyouveddymuch), it isn't recognized as /dev/sda.
I've seen that happen before. It was a tad disconcerting at first, but, yes, in the case I saw it a reboot made it back to sda.
I'll be *very* interested about rebooting.
Another question, for you, or Johnny - where's the source code for udevadm? The documentation doesn't, and I was just trying to yum install the udev-devel so I could look at the source code, and there's no such package.
The udev source RPM should contain this. As CentOS 5 doesn't include udevadm, I'm assuming this is CentOS 6, which does, so you'd want to get
http://vault.centos.org/6.0/cr/SRPMS/Packages/udev-147-2.35.el6.src.rpm
which is the latest CentOS source package. (6.2 has a newer one; you could, if you wanted to, go to ftp.redhat.com and grab the src.rpm there).
Thanks a lot - that's much appreciated.
mark
On Tue, Dec 6, 2011 at 1:46 PM, m.roth@5-cent.us wrote:
Booting purposes is the point: /dev/md0 is /boot. And as the slot's ATA0, it should come up as sda. You mention getting the bootloader sectors over
- do you mean, after it's rebuilt and active, to then rerun grub-install?
Remember, the drive that's failing is /dev/sda. I also don't see why, if I replace the original drive in the original slot (I'm doing this on the clone, whose drive is just fine, thankyouveddymuch), it isn't recognized as /dev/sda.
Booting software RAID1 is kind of an oddball case. You actually boot on the disk that bios considers your boot drive only, and it works because the drives are mirrored and happen to look the same whether you look at the partition or the raid device. You need to install grub separately onto each disk of the pair, though. If you get this part wrong, you can boot from an install/rescue disk and fix it.
Les Mikesell wrote:
On Tue, Dec 6, 2011 at 1:46 PM, m.roth@5-cent.us wrote:
Booting purposes is the point: /dev/md0 is /boot. And as the slot's ATA0, it should come up as sda. You mention getting the bootloader sectors over - do you mean, after it's rebuilt and active, to then rerun grub-install? Remember, the drive that's failing is /dev/sda. I also don't see why, if I replace the original drive in the original slot (I'm doing this on the clone, whose drive is just fine, thankyouveddymuch), it isn't recognized as /dev/sda.
Booting software RAID1 is kind of an oddball case. You actually boot on the disk that bios considers your boot drive only, and it works because the drives are mirrored and happen to look the same whether you look at the partition or the raid device. You need to install grub separately onto each disk of the pair, though. If you get this part wrong, you can boot from an install/rescue disk and fix it.
Ack! So I actually need to run grub-install, and do it for both; didn't know that (always had /boot as a plain vanilla primary partition)?!
Thanks, Les.
On Tuesday, December 06, 2011 03:39:06 PM m.roth@5-cent.us wrote:
Ack! So I actually need to run grub-install, and do it for both; didn't know that (always had /boot as a plain vanilla primary partition)?!
If /dev/sdb doesn't have the stage1.5 in the first 60 or so sectors after the MBR (which it may not) then yes you'll need to re-run grub-install, or you can do a manual setup inside the grub shell. I've seen a HOWTO out there, but its name (or other greppable info) escapes me at the moment.
The stage1.5 portion of grub is not located in /boot, nor is it located in any partition; it's between the MBR and the start of the first partition. Copying the MBR is not enough if you have a stage1.5.
This is assuming that your boot is actually using a stage1.5 at all, and it might not be.
On Tue, Dec 6, 2011 at 2:39 PM, m.roth@5-cent.us wrote:
Booting software RAID1 is kind of an oddball case. You actually boot on the disk that bios considers your boot drive only, and it works because the drives are mirrored and happen to look the same whether you look at the partition or the raid device. You need to install grub separately onto each disk of the pair, though. If you get this part wrong, you can boot from an install/rescue disk and fix it.
Ack! So I actually need to run grub-install, and do it for both; didn't know that (always had /boot as a plain vanilla primary partition)?!
Just keep in mind that you can fix it from an install disk booted in rescue mode if you get it wrong, so don't panic if it won't boot. But, if you have a similar box it would be a good idea to practice - and to be sure someone is around the production box when it is rebooted the first time.
Les Mikesell wrote:
On Tue, Dec 6, 2011 at 2:39 PM, m.roth@5-cent.us wrote:
Booting software RAID1 is kind of an oddball case. You actually boot on the disk that bios considers your boot drive only, and it works because the drives are mirrored and happen to look the same whether you look at the partition or the raid device. You need to install grub separately onto each disk of the pair, though. If you get this part wrong, you can boot from an install/rescue disk and fix it.
Ack! So I actually need to run grub-install, and do it for both; didn't know that (always had /boot as a plain vanilla primary partition)?!
Just keep in mind that you can fix it from an install disk booted in rescue mode if you get it wrong, so don't panic if it won't boot.
Oh, I know that - I'm not a newby on this. On the other hand, I *am* new to Linux software RAID with /boot RAIDed.
But, if you have a similar box it would be a good idea to practice - and to be sure someone is around the production box when it is rebooted the first time.
I think you missed where I was doing this on the clone box, and we will reboot it tomorrow, then I'll do it to the live one Thursday. I don't play around without knowing what the numbers are on production.
mark
Am 06.12.2011 20:21, schrieb m.roth@5-cent.us:
Reindl Harald wrote:
Am 06.12.2011 19:28, schrieb m.roth@5-cent.us:
We're just using Linux software RAID for the first time - RAID1, and the other day, a drive failed. We have a clone machine to play with, so it's not that critical, but....
I partitioned a replacement drive. On the clone, I marked the RAID partitions on /dev/sda failed, and remove, and pulled the drive.
<snip> >> several iterations, I waited a minute or two, until all messages had >> stopped, and there was only /dev/sdb*, and then put the new one in... >> and it appears as /dev/sdc. > > the device name is totally uninteresting, the IDs are > mdadm /dev/mdx --add /dev/sdex
No, it's not uninteresting. I can't be sure that when it reboots, it won't come back as /dev/sda. And the few places I find that have howtos on replacing failed RAID drives don't seem to have run into this issue with udev (I assume) and /dev/sda.
IT IS UNINTERESTING
try it!
you can switch the disks of a software-raid even between different device-types because they are identified by UUID and so /dev/sdx does not matter
it does even not matter in a non-raid as long the referenced per UUID in /etc/fstab
On Tue, Dec 6, 2011 at 12:28 PM, m.roth@5-cent.us wrote:
We're just using Linux software RAID for the first time - RAID1, and the other day, a drive failed. We have a clone machine to play with, so it's not that critical, but....
I partitioned a replacement drive. On the clone, I marked the RAID partitions on /dev/sda failed, and remove, and pulled the drive. After several iterations, I waited a minute or two, until all messages had stopped, and there was only /dev/sdb*, and then put the new one in... and it appears as /dev/sdc. I don't want to reboot the box, and after googling a bit, it looks as though I *might* be able to use udevadm to change that... but the manpage leaves something to be desired... like a man page. The one that appears interesting to me is udevadm --test --action=<string>... and the actual command that would run, but there is *ZERO* information as to what actions are available, other than the default of "add".
Clues for the poor, folks?
If your drive controllers support hot-swap, a freshly swapped drive should appear at the lowest available sd? letter, and a removed one should disappear fairly quickly leaving its identifier available for re-use. But, the disk name does not matter at all. Put the disk in, do a 'dmesg' to see the name the kernel picks for it, add the matching partition and mark it as type 'FD' for future autoassembly. Do a 'cat /proc/mdstat' to see the current raid status. You probably need to 'mdadm --remove /dev/md? /dev/sd?' to remove the failed partition from the running array. Then use 'madam --add /dev/md? /dev/sd? ' with the raid device and new partition names. This will start the mirror sync and should be all you need to do. Then, assuming you are using kernel autodetect to assemble at boot time, it won't matter if the disk is recognized as the same name at bootup, it will still be paired correctly.