Kai Schaetzl spake the following on 6/8/2006 8:36 AM:
William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400:
I can't be much help at all
Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down before you start commenting, "solution" will be at the end :-)
Is it correct that the raid must come into play after initial boot *if* one is to be able to boot on the remaining disk after primary fails?
Not sure what you mean exactly by that, but more or less, yes. The problem wasn't the RAID but grub not wanting to boot automatically or not at all.
My thinking is that if raid is in play when grub is installed, does it mess up where things are put? If not, does it mess up when there's a failure (raid 0,....5... I guess would affect that).
You would want to boot from the disk that did *not* fail or at least resync in the right direction. I think that's also the reason why the resync isn't done automatically. If you happen to boot in the "wrong" disk (and given that it still works but is just a bit "different") and it is then going to resync in the wrong direction ... I'm not quite sure how to determine which is the "correct" disk, though. dmesg(or was it boot.log) shows a lot of lines "considering /dev/hda1" and so on from the RAID detection. It binds the correct disk to the array and then "considers" the next partition. When it encounters the "dropped" partition it won't use that as the "master" for an md partition because that md partition is already active. So, all is well and I can resync then. However, if it were to do this in another order, e.g. first "consider" the failed disk ... - I don't know what would happen. Maybe it goes after some timestamp, in that case the failed disk should always have an older timestamp since it dropped out of the array. Or uses another method to mark that disk as "stale" or whatever. In that case I think one doesn't have to worry about booting in the wrong disk.
Is it boot from boot partition, install grub, copy to second disk.
No, I installed a normal RAID setup with the CentOS install CD on both disks (no LVM) and then installed grub to the second disk additionally after setup had finished. That could be done from the setup screen as well but it's not supported, you can only choose which disk to install to, but not to both of them. A serious flaw in the setup screen I think, but one made at the upstream vendor. I think your scenario above would very well fit if one of the drives failed and got replaced. I'd then need to copy the partition structure to it, resync and then setup grub on it (or setup grub first and then resync). I'm not sure how I should copy the partition structure. The grub+RAID howto (3 years old) gives the following for copying the structure: # sfdisk - /dev/hdx | sfdisk /dev/hdy Does this sound to be the right way?
Pardon my chiming in, but this is an adequate way to copy the partition data, and that is what I used on my software raid systems. Don't forget that the first part is sfdisk -d, not just sfdisk -. Just be careful that you specify the correct devices in the right places. Otherwise you will have 2 blank disks! You could use "sfdisk -d /dev/hdx >somefile" and you would have a backup of the partition structure just in case.
Is
grub (stage 1?) installed in *partition* or in mbr? If in partition, partition must be made active? Both drives?
That's all done by the setup (I checked later). I'm not sure "where" grub gets installed. AFAIK there's a boot loader put into the MBR that will then load grub from the partition it was installed to (which is mounted as /boot).
If raid is active when you try to write initial loader to mbr/partition, does it mess it up?
Doesn't look like so. I think that's because MBR isn't part of the RAID.
Can you get to where you can boot one disk w/o raid active?
It seems that it gets automatically detected that the disk is part of a RAID array. So, if you just boot one disk it is still booted as RAID, just with one active disk.
It would be
aggravating if it was nothing raid related, but some hardware/user/software error making you think it was your raid doing it to you.
I was sure it wasn't the RAID, the problem was grub.
I'd appreciate if anyone reading my comments and sees some need for correction or improvements do so, thanks.
Ok, what I did is start over. First I installed again by just formatting the partition but let the partition structure live on. That gave me more or less the same result. It seems the grub stuff didn't get properly written over that way. So I removed the partitions and cleaned the whole disk. And did another installation. This time everything seems to be doing fine. However, I tested only by shutting down, removing one of the disks and then booting up. I can pull out any of the disks, boot from the other and resync the pulled out one. I didn't test dropping one of the disks in the middle of operation yet.
Don't do that! Don't test by pulling a running disk unless it is in hotplug capable hardware. Test by using mdadm to remove that drive.
There are two things I realized:
- disks need to run on different IDE controllers it seems. So, one has to
be primary master, the other secondary master. At least for the boot disks and how it works here. I had them first hanging as master and slave on the same controller and then I get the boot disk failure I mentioned in my first posting when the master is pulled out. The good thing about this is that I was able to do this without any changes or setting up the system again. It recognized the disk moving from hdb to hdc and I just needed to resync.
That info is in the software raid howto. Some systems are more tolerant than others, but usually a failed drive will lock the entire channel, so the primary and the slave would go down.
- the grub.conf doesn't seem to matter much. Excerpt:
default=0 fallback=2
title CentOS (2.6.9-34.0.1.EL) root (hd0,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img title CentOS 2 root (hd1,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img
Here it always boots 0, no matter if disk 0 is powered down or powered up, but not active in the array anymore. It somehow finds the correct setup, anyway.
Kai