[CentOS] Boot from degraded sw RAID 1

Wed Aug 13 18:07:18 UTC 2008
Eduardo Grosclaude <eduardo.grosclaude at gmail.com>

OK, this is probably long, and your answer will surely make me slap my
forehead really hard... please help me understand what goes on.

I intend to install CentOS 5.1 afresh over software RAID level 1. SATA
drives are in AHCI mode.

I follow basically [1] though I have made some mistakes as will be
explained. AFAIK GRUB does not boot off LVM, so I:

1. Build a 100MB RAID-type partition on each disk
2. Build a second RAID-type partition taking the remaining space on each disk
3. Build a RAID 1 device over the small partitions
4. Build a second RAID 1 device over the bigger ones
5. Declare /boot as ext3 to live on the smaller RAID 1 device
6. Declare an LVM PV to live on the bigger one
7. Build a VG on the PV, then build LVs for swap, / and /data on the VG

Only problem is, I numbly have failed to follow [1] in that I left
Disk Druid to make partitions wherever it chooses, so now I have
cross-named partitions... md0 is the bigger RAID 1 device with
/dev/sda2 AND /dev/sdb1... and md1 is the smaller one with /dev/sda1
AND /dev/sdb2. Oh well, things can't get complicated on this, I tell
myself.

Installation goes on well, system boots. I update the system. Now I
want to be able to boot from whichever disk survives an accident. If I
take out sdb, system boots. If I take out sda, system refuses to work.
Aha. GRUB is not installed into sdb's MBR. Reconnect sda, reboot.
Prepare for GRUB device juggling as in [1].

Into GRUB console I do
> find /grub/stage1
hd(0,0)
hd(1,1)
> device (hd0) /dev/sdb
> root (hd0,1)
Filesystem type is ext2fs, partition type 0xfd
> setup (hd0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  15 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,1)/grub/stage2 /grub/grub
.conf"... succeeded
Done.
> quit

The rationale for this being that, when the faulty disk is removed at
boot, the remaining one (currently /dev/sdb) will be addressed as
/dev/sda (i.e. hd0 in GRUB parlance).

Now I prune /dev/sda and reboot. I see:

root(hd0,0) <----------------------------------------------- interesting
Filesystem type unknown, partition type 0xfd
kernel /vmlinuz-2.6........ ro root=/dev/VolGroup00/LogVol01 rhgb quiet
Error 17: Cannot mount selected partition
Press any key to continue...

Not quite what I expected. I enter GRUB console at boot and repeat the
above device juggling.

> find /grub/stage1
hd(0,1)
> root (hd0,1)
> setup (hd0)
> quit

However, 'quit' seems to fail as grub keeps prompting me without
really quitting. After a forced reboot I get the very same error
message as above.

I edit the GRUB configuration entry at boot (with e command) and see
root (hd0,0)
as the first line

It should be root(hd0,1), so probably GRUB did not write down my
modifications. I edit it to read so (e command again), and then boot
(b command). Now it works. I rebuild the arrays succesfully. However,
I have made a one-time edit and the problem is still there.

I understand the error message from the booting process was
reasonable: hd(0,0) carries an unknown filesystem -- an LVM device.
[1] was right, you definitely want to have exact disk duplicates to
keep your life simple.

However, I can't see why it shouldn't work the way it is. I can
probably rebuild the secondary disk to mimic the primary's partition
numbering and "fix" my problem...

But, Am I right about the GRUB console commands I was issuing? How can
I make them permanent then? I KNOW I did 'quit' from the grub console
the first time, when from inside bash, when the system was running.
What am I missing?

Thank you in advance

[1] http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html
-- 
Eduardo Grosclaude
Universidad Nacional del Comahue
Neuquen, Argentina