Kai Schaetzl spake the following on 6/8/2006 8:36 AM: > William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400: > >> I can't be much help at all > > Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down > before you start commenting, "solution" will be at the end :-) > >> Is it correct that the raid must come into play after initial boot *if* >> one is to be able to boot on the remaining disk after primary fails? > > Not sure what you mean exactly by that, but more or less, yes. The problem > wasn't the RAID but grub not wanting to boot automatically or not at all. > >> >> My thinking is that if raid is in play when grub is installed, does it >> mess up where things are put? If not, does it mess up when there's a >> failure (raid 0,....5... I guess would affect that). > > You would want to boot from the disk that did *not* fail or at least resync > in the right direction. I think that's also the reason why the resync isn't > done automatically. If you happen to boot in the "wrong" disk (and given > that it still works but is just a bit "different") and it is then going to > resync in the wrong direction ... > I'm not quite sure how to determine which is the "correct" disk, though. > dmesg(or was it boot.log) shows a lot of lines "considering /dev/hda1" and > so on from the RAID detection. It binds the correct disk to the array and > then "considers" the next partition. When it encounters the "dropped" > partition it won't use that as the "master" for an md partition because > that md partition is already active. So, all is well and I can resync then. > However, if it were to do this in another order, e.g. first "consider" the > failed disk ... - I don't know what would happen. Maybe it goes after some > timestamp, in that case the failed disk should always have an older > timestamp since it dropped out of the array. Or uses another method to mark > that disk as "stale" or whatever. In that case I think one doesn't have to > worry about booting in the wrong disk. > >> >> Is it boot from boot partition, install grub, copy to second disk. > > No, I installed a normal RAID setup with the CentOS install CD on both > disks (no LVM) and then installed grub to the second disk additionally > after setup had finished. That could be done from the setup screen as well > but it's not supported, you can only choose which disk to install to, but > not to both of them. A serious flaw in the setup screen I think, but one > made at the upstream vendor. > I think your scenario above would very well fit if one of the drives failed > and got replaced. I'd then need to copy the partition structure to it, > resync and then setup grub on it (or setup grub first and then resync). I'm > not sure how I should copy the partition structure. > The grub+RAID howto (3 years old) gives the following for copying the > structure: > # sfdisk - /dev/hdx | sfdisk /dev/hdy > Does this sound to be the right way? Pardon my chiming in, but this is an adequate way to copy the partition data, and that is what I used on my software raid systems. Don't forget that the first part is sfdisk -d, not just sfdisk -. Just be careful that you specify the correct devices in the right places. Otherwise you will have 2 blank disks! You could use "sfdisk -d /dev/hdx >somefile" and you would have a backup of the partition structure just in case. > > Is >> grub (stage 1?) installed in *partition* or in mbr? If in partition, >> partition must be made active? Both drives? > > That's all done by the setup (I checked later). I'm not sure "where" grub > gets installed. AFAIK there's a boot loader put into the MBR that will then > load grub from the partition it was installed to (which is mounted as > /boot). > >> If raid is active when you try to write initial loader to mbr/partition, >> does it mess it up? > > Doesn't look like so. I think that's because MBR isn't part of the RAID. > >> >> Can you get to where you can boot one disk w/o raid active? > > It seems that it gets automatically detected that the disk is part of a > RAID array. So, if you just boot one disk it is still booted as RAID, just > with one active disk. > > It would be >> aggravating if it was nothing raid related, but some >> hardware/user/software error making you think it was your raid doing it >> to you. > > I was sure it wasn't the RAID, the problem was grub. > > I'd appreciate if anyone reading my comments and sees some need for > correction or improvements do so, thanks. > > Ok, what I did is start over. First I installed again by just formatting > the partition but let the partition structure live on. That gave me more or > less the same result. It seems the grub stuff didn't get properly written > over that way. So I removed the partitions and cleaned the whole disk. And > did another installation. This time everything seems to be doing fine. > However, I tested only by shutting down, removing one of the disks and then > booting up. I can pull out any of the disks, boot from the other and resync > the pulled out one. I didn't test dropping one of the disks in the middle > of operation yet. Don't do that! Don't test by pulling a running disk unless it is in hotplug capable hardware. Test by using mdadm to remove that drive. > There are two things I realized: > > 1. disks need to run on different IDE controllers it seems. So, one has to > be primary master, the other secondary master. At least for the boot disks > and how it works here. I had them first hanging as master and slave on the > same controller and then I get the boot disk failure I mentioned in my > first posting when the master is pulled out. The good thing about this is > that I was able to do this without any changes or setting up the system > again. It recognized the disk moving from hdb to hdc and I just needed to > resync. That info is in the software raid howto. Some systems are more tolerant than others, but usually a failed drive will lock the entire channel, so the primary and the slave would go down. > > 2. the grub.conf doesn't seem to matter much. Excerpt: > > default=0 > fallback=2 > > title CentOS (2.6.9-34.0.1.EL) > root (hd0,0) > kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 > initrd /initrd-2.6.9-34.0.1.EL.img > title CentOS 2 > root (hd1,0) > kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 > initrd /initrd-2.6.9-34.0.1.EL.img > > Here it always boots 0, no matter if disk 0 is powered down or powered up, > but not active in the array anymore. It somehow finds the correct setup, > anyway. > > > Kai > -- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!