[CentOS] Re: Problem with dual-booting soft-RAID
Scott Silva
ssilva at sgvwater.com
Thu Jun 8 16:29:12 UTC 2006
Kai Schaetzl spake the following on 6/8/2006 8:36 AM:
> William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400:
>
>> I can't be much help at all
>
> Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down
> before you start commenting, "solution" will be at the end :-)
>
>> Is it correct that the raid must come into play after initial boot *if*
>> one is to be able to boot on the remaining disk after primary fails?
>
> Not sure what you mean exactly by that, but more or less, yes. The problem
> wasn't the RAID but grub not wanting to boot automatically or not at all.
>
>>
>> My thinking is that if raid is in play when grub is installed, does it
>> mess up where things are put? If not, does it mess up when there's a
>> failure (raid 0,....5... I guess would affect that).
>
> You would want to boot from the disk that did *not* fail or at least resync
> in the right direction. I think that's also the reason why the resync isn't
> done automatically. If you happen to boot in the "wrong" disk (and given
> that it still works but is just a bit "different") and it is then going to
> resync in the wrong direction ...
> I'm not quite sure how to determine which is the "correct" disk, though.
> dmesg(or was it boot.log) shows a lot of lines "considering /dev/hda1" and
> so on from the RAID detection. It binds the correct disk to the array and
> then "considers" the next partition. When it encounters the "dropped"
> partition it won't use that as the "master" for an md partition because
> that md partition is already active. So, all is well and I can resync then.
> However, if it were to do this in another order, e.g. first "consider" the
> failed disk ... - I don't know what would happen. Maybe it goes after some
> timestamp, in that case the failed disk should always have an older
> timestamp since it dropped out of the array. Or uses another method to mark
> that disk as "stale" or whatever. In that case I think one doesn't have to
> worry about booting in the wrong disk.
>
>>
>> Is it boot from boot partition, install grub, copy to second disk.
>
> No, I installed a normal RAID setup with the CentOS install CD on both
> disks (no LVM) and then installed grub to the second disk additionally
> after setup had finished. That could be done from the setup screen as well
> but it's not supported, you can only choose which disk to install to, but
> not to both of them. A serious flaw in the setup screen I think, but one
> made at the upstream vendor.
> I think your scenario above would very well fit if one of the drives failed
> and got replaced. I'd then need to copy the partition structure to it,
> resync and then setup grub on it (or setup grub first and then resync). I'm
> not sure how I should copy the partition structure.
> The grub+RAID howto (3 years old) gives the following for copying the
> structure:
> # sfdisk - /dev/hdx | sfdisk /dev/hdy
> Does this sound to be the right way?
Pardon my chiming in, but this is an adequate way to copy the partition data,
and that is what I used on my software raid systems. Don't forget that the
first part is sfdisk -d, not just sfdisk -. Just be careful that you specify
the correct devices in the right places. Otherwise you will have 2 blank
disks! You could use "sfdisk -d /dev/hdx >somefile" and you would have a
backup of the partition structure just in case.
>
> Is
>> grub (stage 1?) installed in *partition* or in mbr? If in partition,
>> partition must be made active? Both drives?
>
> That's all done by the setup (I checked later). I'm not sure "where" grub
> gets installed. AFAIK there's a boot loader put into the MBR that will then
> load grub from the partition it was installed to (which is mounted as
> /boot).
>
>> If raid is active when you try to write initial loader to mbr/partition,
>> does it mess it up?
>
> Doesn't look like so. I think that's because MBR isn't part of the RAID.
>
>>
>> Can you get to where you can boot one disk w/o raid active?
>
> It seems that it gets automatically detected that the disk is part of a
> RAID array. So, if you just boot one disk it is still booted as RAID, just
> with one active disk.
>
> It would be
>> aggravating if it was nothing raid related, but some
>> hardware/user/software error making you think it was your raid doing it
>> to you.
>
> I was sure it wasn't the RAID, the problem was grub.
>
> I'd appreciate if anyone reading my comments and sees some need for
> correction or improvements do so, thanks.
>
> Ok, what I did is start over. First I installed again by just formatting
> the partition but let the partition structure live on. That gave me more or
> less the same result. It seems the grub stuff didn't get properly written
> over that way. So I removed the partitions and cleaned the whole disk. And
> did another installation. This time everything seems to be doing fine.
> However, I tested only by shutting down, removing one of the disks and then
> booting up. I can pull out any of the disks, boot from the other and resync
> the pulled out one. I didn't test dropping one of the disks in the middle
> of operation yet.
Don't do that! Don't test by pulling a running disk unless it is in hotplug
capable hardware. Test by using mdadm to remove that drive.
> There are two things I realized:
>
> 1. disks need to run on different IDE controllers it seems. So, one has to
> be primary master, the other secondary master. At least for the boot disks
> and how it works here. I had them first hanging as master and slave on the
> same controller and then I get the boot disk failure I mentioned in my
> first posting when the master is pulled out. The good thing about this is
> that I was able to do this without any changes or setting up the system
> again. It recognized the disk moving from hdb to hdc and I just needed to
> resync.
That info is in the software raid howto. Some systems are more tolerant than
others, but usually a failed drive will lock the entire channel, so the
primary and the slave would go down.
>
> 2. the grub.conf doesn't seem to matter much. Excerpt:
>
> default=0
> fallback=2
>
> title CentOS (2.6.9-34.0.1.EL)
> root (hd0,0)
> kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1
> initrd /initrd-2.6.9-34.0.1.EL.img
> title CentOS 2
> root (hd1,0)
> kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1
> initrd /initrd-2.6.9-34.0.1.EL.img
>
> Here it always boots 0, no matter if disk 0 is powered down or powered up,
> but not active in the array anymore. It somehow finds the correct setup,
> anyway.
>
>
> Kai
>
--
MailScanner is like deodorant...
You hope everybody uses it, and
you notice quickly if they don't!!!!
More information about the CentOS
mailing list