[CentOS] Re: Problem with dual-booting soft-RAID

Thu Jun 8 16:29:12 UTC 2006

Kai Schaetzl spake the following on 6/8/2006 8:36 AM:
> William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400:
> 
>> I can't be much help at all
> 
> Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down 
> before you start commenting, "solution" will be at the end :-)
> 
>> Is it correct that the raid must come into play after initial boot *if* 
>> one is to be able to boot on the remaining disk after primary fails? 
> 
> Not sure what you mean exactly by that, but more or less, yes. The problem 
> wasn't the RAID but grub not wanting to boot automatically or not at all.
> 
>>  
>> My thinking is that if raid is in play when grub is installed, does it 
>> mess up where things are put? If not, does it mess up when there's a 
>> failure (raid 0,....5... I guess would affect that). 
> 
> You would want to boot from the disk that did *not* fail or at least resync 
> in the right direction. I think that's also the reason why the resync isn't 
> done automatically. If you happen to boot in the "wrong" disk (and given 
> that it still works but is just a bit "different") and it is then going to 
> resync in the wrong direction ...
> I'm not quite sure how to determine which is the "correct" disk, though. 
> dmesg(or was it boot.log) shows a lot of lines "considering /dev/hda1" and 
> so on from the RAID detection. It binds the correct disk to the array and 
> then "considers" the next partition. When it encounters the "dropped" 
> partition it won't use that as the "master" for an md partition because 
> that md partition is already active. So, all is well and I can resync then. 
> However, if it were to do this in another order, e.g. first "consider" the 
> failed disk ... - I don't know what would happen. Maybe it goes after some 
> timestamp, in that case the failed disk should always have an older 
> timestamp since it dropped out of the array. Or uses another method to mark 
> that disk as "stale" or whatever. In that case I think one doesn't have to 
> worry about booting in the wrong disk.
> 
>>  
>> Is it boot from boot partition, install grub, copy to second disk.
> 
> No, I installed a normal RAID setup with the CentOS install CD on both 
> disks (no LVM) and then installed grub to the second disk additionally 
> after setup had finished. That could be done from the setup screen as well 
> but it's not supported, you can only choose which disk to install to, but 
> not to both of them. A serious flaw in the setup screen I think, but one 
> made at the upstream vendor.
> I think your scenario above would very well fit if one of the drives failed 
> and got replaced. I'd then need to copy the partition structure to it, 
> resync and then setup grub on it (or setup grub first and then resync). I'm 
> not sure how I should copy the partition structure.
> The grub+RAID howto (3 years old) gives the following for copying the 
> structure:
> # sfdisk - /dev/hdx | sfdisk /dev/hdy
> Does this sound to be the right way?
Pardon my chiming in, but this is an adequate way to copy the partition data,
and that is what I used on my software raid systems. Don't forget that the
first part is sfdisk -d, not just sfdisk -. Just be careful that you specify
the correct devices in the right places. Otherwise you will have 2 blank
disks! You could use "sfdisk -d  /dev/hdx >somefile" and you would have a
backup of the partition structure just in case.
> 
>  Is 
>> grub (stage 1?) installed in *partition* or in mbr? If in partition, 
>> partition must be made active? Both drives? 
> 
> That's all done by the setup (I checked later). I'm not sure "where" grub 
> gets installed. AFAIK there's a boot loader put into the MBR that will then 
> load grub from the partition it was installed to (which is mounted as 
> /boot).
> 
>> If raid is active when you try to write initial loader to mbr/partition, 
>> does it mess it up? 
> 
> Doesn't look like so. I think that's because MBR isn't part of the RAID.
> 
>>  
>> Can you get to where you can boot one disk w/o raid active?
> 
> It seems that it gets automatically detected that the disk is part of a 
> RAID array. So, if you just boot one disk it is still booted as RAID, just 
> with one active disk.
> 
>  It would be 
>> aggravating if it was nothing raid related, but some 
>> hardware/user/software error making you think it was your raid doing it 
>> to you.
> 
> I was sure it wasn't the RAID, the problem was grub.
> 
> I'd appreciate if anyone reading my comments and sees some need for 
> correction or improvements do so, thanks.
> 
> Ok, what I did is start over. First I installed again by just formatting 
> the partition but let the partition structure live on. That gave me more or 
> less the same result. It seems the grub stuff didn't get properly written 
> over that way. So I removed the partitions and cleaned the whole disk. And 
> did another installation. This time everything seems to be doing fine. 
> However, I tested only by shutting down, removing one of the disks and then 
> booting up. I can pull out any of the disks, boot from the other and resync 
> the pulled out one. I didn't test dropping one of the disks in the middle 
> of operation yet.
Don't do that! Don't test by pulling a running disk unless it is in hotplug
capable hardware. Test by using mdadm to remove that drive.

> There are two things I realized:
> 
> 1. disks need to run on different IDE controllers it seems. So, one has to 
> be primary master, the other secondary master. At least for the boot disks 
> and how it works here. I had them first hanging as master and slave on the 
> same controller and then I get the boot disk failure I mentioned in my 
> first posting when the master is pulled out. The good thing about this is 
> that I was able to do this without any changes or setting up the system 
> again. It recognized the disk moving from hdb to hdc and I just needed to 
> resync.
That info is in the software raid howto. Some systems are more tolerant than
others, but usually a failed drive will lock the entire channel, so the
primary and the slave would go down.

> 
> 2. the grub.conf doesn't seem to matter much. Excerpt:
> 
> default=0 
> fallback=2 
> 
> title CentOS (2.6.9-34.0.1.EL) 
>    root (hd0,0) 
>    kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 
>    initrd /initrd-2.6.9-34.0.1.EL.img 
> title CentOS 2
>    root (hd1,0) 
>    kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 
>    initrd /initrd-2.6.9-34.0.1.EL.img
> 
> Here it always boots 0, no matter if disk 0 is powered down or powered up, 
> but not active in the array anymore. It somehow finds the correct setup, 
> anyway.
> 
> 
> Kai
> 

-- 

MailScanner is like deodorant...
You hope everybody uses it, and
you notice quickly if they don't!!!!