OK, this is probably long, and your answer will surely make me slap my forehead really hard... please help me understand what goes on.
I intend to install CentOS 5.1 afresh over software RAID level 1. SATA drives are in AHCI mode.
I follow basically [1] though I have made some mistakes as will be explained. AFAIK GRUB does not boot off LVM, so I:
1. Build a 100MB RAID-type partition on each disk 2. Build a second RAID-type partition taking the remaining space on each disk 3. Build a RAID 1 device over the small partitions 4. Build a second RAID 1 device over the bigger ones 5. Declare /boot as ext3 to live on the smaller RAID 1 device 6. Declare an LVM PV to live on the bigger one 7. Build a VG on the PV, then build LVs for swap, / and /data on the VG
Only problem is, I numbly have failed to follow [1] in that I left Disk Druid to make partitions wherever it chooses, so now I have cross-named partitions... md0 is the bigger RAID 1 device with /dev/sda2 AND /dev/sdb1... and md1 is the smaller one with /dev/sda1 AND /dev/sdb2. Oh well, things can't get complicated on this, I tell myself.
Installation goes on well, system boots. I update the system. Now I want to be able to boot from whichever disk survives an accident. If I take out sdb, system boots. If I take out sda, system refuses to work. Aha. GRUB is not installed into sdb's MBR. Reconnect sda, reboot. Prepare for GRUB device juggling as in [1].
Into GRUB console I do
find /grub/stage1
hd(0,0) hd(1,1)
device (hd0) /dev/sdb root (hd0,1)
Filesystem type is ext2fs, partition type 0xfd
setup (hd0)
Checking if "/boot/grub/stage1" exists... no Checking if "/grub/stage1" exists... yes Checking if "/grub/stage2" exists... yes Checking if "/grub/e2fs_stage1_5" exists... yes Running "embed /grub/e2fs_stage1_5 (hd0)"... 15 sectors are embedded. succeeded Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,1)/grub/stage2 /grub/grub .conf"... succeeded Done.
quit
The rationale for this being that, when the faulty disk is removed at boot, the remaining one (currently /dev/sdb) will be addressed as /dev/sda (i.e. hd0 in GRUB parlance).
Now I prune /dev/sda and reboot. I see:
root(hd0,0) <----------------------------------------------- interesting Filesystem type unknown, partition type 0xfd kernel /vmlinuz-2.6........ ro root=/dev/VolGroup00/LogVol01 rhgb quiet Error 17: Cannot mount selected partition Press any key to continue...
Not quite what I expected. I enter GRUB console at boot and repeat the above device juggling.
find /grub/stage1
hd(0,1)
root (hd0,1) setup (hd0) quit
However, 'quit' seems to fail as grub keeps prompting me without really quitting. After a forced reboot I get the very same error message as above.
I edit the GRUB configuration entry at boot (with e command) and see root (hd0,0) as the first line
It should be root(hd0,1), so probably GRUB did not write down my modifications. I edit it to read so (e command again), and then boot (b command). Now it works. I rebuild the arrays succesfully. However, I have made a one-time edit and the problem is still there.
I understand the error message from the booting process was reasonable: hd(0,0) carries an unknown filesystem -- an LVM device. [1] was right, you definitely want to have exact disk duplicates to keep your life simple.
However, I can't see why it shouldn't work the way it is. I can probably rebuild the secondary disk to mimic the primary's partition numbering and "fix" my problem...
But, Am I right about the GRUB console commands I was issuing? How can I make them permanent then? I KNOW I did 'quit' from the grub console the first time, when from inside bash, when the system was running. What am I missing?
Thank you in advance
[1] http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html