[CentOS] Problem with dual-booting soft-RAID

Thu Jun 8 15:36:02 UTC 2006
Kai Schaetzl <maillists at conactive.com>

William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400:

> I can't be much help at all

Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down 
before you start commenting, "solution" will be at the end :-)

> Is it correct that the raid must come into play after initial boot *if* 
> one is to be able to boot on the remaining disk after primary fails? 

Not sure what you mean exactly by that, but more or less, yes. The problem 
wasn't the RAID but grub not wanting to boot automatically or not at all.

>  
> My thinking is that if raid is in play when grub is installed, does it 
> mess up where things are put? If not, does it mess up when there's a 
> failure (raid 0,....5... I guess would affect that). 

You would want to boot from the disk that did *not* fail or at least resync 
in the right direction. I think that's also the reason why the resync isn't 
done automatically. If you happen to boot in the "wrong" disk (and given 
that it still works but is just a bit "different") and it is then going to 
resync in the wrong direction ...
I'm not quite sure how to determine which is the "correct" disk, though. 
dmesg(or was it boot.log) shows a lot of lines "considering /dev/hda1" and 
so on from the RAID detection. It binds the correct disk to the array and 
then "considers" the next partition. When it encounters the "dropped" 
partition it won't use that as the "master" for an md partition because 
that md partition is already active. So, all is well and I can resync then. 
However, if it were to do this in another order, e.g. first "consider" the 
failed disk ... - I don't know what would happen. Maybe it goes after some 
timestamp, in that case the failed disk should always have an older 
timestamp since it dropped out of the array. Or uses another method to mark 
that disk as "stale" or whatever. In that case I think one doesn't have to 
worry about booting in the wrong disk.

>  
> Is it boot from boot partition, install grub, copy to second disk.

No, I installed a normal RAID setup with the CentOS install CD on both 
disks (no LVM) and then installed grub to the second disk additionally 
after setup had finished. That could be done from the setup screen as well 
but it's not supported, you can only choose which disk to install to, but 
not to both of them. A serious flaw in the setup screen I think, but one 
made at the upstream vendor.
I think your scenario above would very well fit if one of the drives failed 
and got replaced. I'd then need to copy the partition structure to it, 
resync and then setup grub on it (or setup grub first and then resync). I'm 
not sure how I should copy the partition structure.
The grub+RAID howto (3 years old) gives the following for copying the 
structure:
# sfdisk - /dev/hdx | sfdisk /dev/hdy
Does this sound to be the right way?

 Is 
> grub (stage 1?) installed in *partition* or in mbr? If in partition, 
> partition must be made active? Both drives? 

That's all done by the setup (I checked later). I'm not sure "where" grub 
gets installed. AFAIK there's a boot loader put into the MBR that will then 
load grub from the partition it was installed to (which is mounted as 
/boot).

> If raid is active when you try to write initial loader to mbr/partition, 
> does it mess it up? 

Doesn't look like so. I think that's because MBR isn't part of the RAID.

>  
> Can you get to where you can boot one disk w/o raid active?

It seems that it gets automatically detected that the disk is part of a 
RAID array. So, if you just boot one disk it is still booted as RAID, just 
with one active disk.

 It would be 
> aggravating if it was nothing raid related, but some 
> hardware/user/software error making you think it was your raid doing it 
> to you.

I was sure it wasn't the RAID, the problem was grub.

I'd appreciate if anyone reading my comments and sees some need for 
correction or improvements do so, thanks.

Ok, what I did is start over. First I installed again by just formatting 
the partition but let the partition structure live on. That gave me more or 
less the same result. It seems the grub stuff didn't get properly written 
over that way. So I removed the partitions and cleaned the whole disk. And 
did another installation. This time everything seems to be doing fine. 
However, I tested only by shutting down, removing one of the disks and then 
booting up. I can pull out any of the disks, boot from the other and resync 
the pulled out one. I didn't test dropping one of the disks in the middle 
of operation yet.

There are two things I realized:

1. disks need to run on different IDE controllers it seems. So, one has to 
be primary master, the other secondary master. At least for the boot disks 
and how it works here. I had them first hanging as master and slave on the 
same controller and then I get the boot disk failure I mentioned in my 
first posting when the master is pulled out. The good thing about this is 
that I was able to do this without any changes or setting up the system 
again. It recognized the disk moving from hdb to hdc and I just needed to 
resync.

2. the grub.conf doesn't seem to matter much. Excerpt:

default=0 
fallback=2 

title CentOS (2.6.9-34.0.1.EL) 
   root (hd0,0) 
   kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 
   initrd /initrd-2.6.9-34.0.1.EL.img 
title CentOS 2
   root (hd1,0) 
   kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 
   initrd /initrd-2.6.9-34.0.1.EL.img

Here it always boots 0, no matter if disk 0 is powered down or powered up, 
but not active in the array anymore. It somehow finds the correct setup, 
anyway.


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com