I set up IDE software-RAID on a test machine for testing the recover functionality. The problem is that it somehow won't boot after removing the first disk, although grub etc. seems to be setup okay. Here's some background:
2 disks hda and hdb RAID 1 setup: /dev/md0 (hda1 + hdb1) /boot /dev/md1 (hda2 + hdb2) / /dev/md2 (hda3 + hdb3) swap
done with disk druid during initial setup from server disk, so far so good. Then I installed grub to hdb. grub-install complained all the time (I may have given it the wrong parameter), so I used grub directly: root (hd1,0) setup (hd1) That's correct, isn't it? (Is there a way to determine if grub is installed on the correct disk?) I added a third boot entry to the grub.conf (see below).
I then removed disk 1 (hda) from the array by unpowering it and rebooted. Unfortunately, that sent me right in the BIOS PXE boot instead of trying the other disk, well, that's a BIOS problem. So I repowered disk 1 and rebooted. The grub on it is still working somehow but just fills the screen with "grub". So I moved disk 2 before disk 1 as a boot device in BIOS and tried again. This time I get a grub prompt. So, grub is obviously correctly installed on it. However, why doesn't it then boot?
When I type in the exact same data that I have in grub.conf it boots. I resynced the disks now, but still can't boot normally.
grub.conf is like this (taken from another non-RAID, full setup grub.conf and adjusted, so it differs cosmetically):
default=0 fallback=2 timeout=10 #splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title CentOS (2.6.9-34.0.1.EL) root (hd0,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img title CentOS-4 Server_CD (2.6.9-34.EL) root (hd0,0) kernel /vmlinuz-2.6.9-34.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.EL.img title CentOS 2 (2.6.9-34.0.1.EL) root (hd1,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img
When I change default to 2 it still doesn't boot, just gets me grub. I then reinstalled grub on the resynced hda with the same commands used earlier (adjusted for it, of course) and rebooted with hda as first boot disk. There's still the same behavior as before: booting from hda gives "grub grub grub ...".
So, none of two boots like it should. disk 1 cannot boot at all, although resynced and disk 2 only manually. What could be the problem?
Kai
Kai Schaetzl wrote on Mon, 05 Jun 2006 01:29:25 +0200:
done with disk druid during initial setup from server disk, so far so good. Then I installed grub to hdb.
Unfortunately, no one replied yet. Either because I'm doing something *really* stupid or no one knows a good answer? Am I supposed to install grub on both disks *before* RAIDing them together?
Kai
On Tue, 2006-06-06 at 15:31 +0200, Kai Schaetzl wrote:
Kai Schaetzl wrote on Mon, 05 Jun 2006 01:29:25 +0200:
done with disk druid during initial setup from server disk, so far so good. Then I installed grub to hdb.
Unfortunately, no one replied yet. Either because I'm doing something *really* stupid or no one knows a good answer?
I have never used so I can't help. But I *do* remember this being discussed several times on the list and, IIRC, it was always make boot partitions on each drive and install grub, test that both are bootable, install all else, "kill" a drive and test again.
But, as I said, I've never done it and this is from memory.
Google with the right keywords and "site:centos" should get the postings quickly.
Am I supposed to install grub on both disks *before* RAIDing them together?
Kai
HTH
On Tue, 2006-06-06 at 12:40 -0400, William L. Maltby wrote:
On Tue, 2006-06-06 at 15:31 +0200, Kai Schaetzl wrote:
Kai Schaetzl wrote on Mon, 05 Jun 2006 01:29:25 +0200:
<snip>
Am I supposed to install grub on both disks *before* RAIDing them together?
Almost forgot! There were also some instances of raided drives failing and notification not being posted to the admins. I don't remember if it was hard/software raid or branding, but you might want to look at that while you're googling.
<snip sig stuff>
William L. Maltby wrote on Tue, 06 Jun 2006 12:40:29 -0400:
do* remember this being discussed several times on the list
Yeah, and I followed them all with interest and read the referenced how-tos. That's why I think I did all correct, but still it doesn't work. Mostly, it bothers me that the installation of grub to both disks seems to work but nothing changes.
it was always make boot partitions on each drive and install grub, test that both are bootable, install all else, "kill" a drive and test again.
I setup the RAID mirror during initial installation, then installed grub to both disks following a how-to and then tested. The problem is that installing grub doesn't change anything on none of the drives it seems. I did it now several times and there is no change. The first one is "grub grub grub"ing and the second puts me in the grub prompt and boots only "manually".
Kai
On Tue, 2006-06-06 at 21:31 +0200, Kai Schaetzl wrote:
William L. Maltby wrote on Tue, 06 Jun 2006 12:40:29 -0400:
do* remember this being discussed several times on the list
Yeah, and I followed them all with interest and read the referenced how-tos. That's why I think I did all correct, but still it doesn't work. Mostly, it bothers me that the installation of grub to both disks seems to work but nothing changes.
it was always make boot partitions on each drive and install grub, test that both are bootable, install all else, "kill" a drive and test again.
I setup the RAID mirror during initial installation, then installed grub to both disks following a how-to and then tested. The problem is that installing grub doesn't change anything on none of the drives it seems. I did it now several times and there is no change. The first one is "grub grub grub"ing and the second puts me in the grub prompt and boots only "manually".
I can't be much help at all unless my background (several years back - memory may have aged out and been permanently "swapped") using LILO for some stuff might give clues.
I continue only in the hope that the answer might come to you while awaiting knowledgeable participation.
Is it correct that the raid must come into play after initial boot *if* one is to be able to boot on the remaining disk after primary fails?
My thinking is that if raid is in play when grub is installed, does it mess up where things are put? If not, does it mess up when there's a failure (raid 0,....5... I guess would affect that).
Is it boot from boot partition, install grub, copy to second disk. Is grub (stage 1?) installed in *partition* or in mbr? If in partition, partition must be made active? Both drives?
If raid is active when you try to write initial loader to mbr/partition, does it mess it up?
Can you get to where you can boot one disk w/o raid active? It would be aggravating if it was nothing raid related, but some hardware/user/software error making you think it was your raid doing it to you.
Kai
HTH
William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400:
I can't be much help at all
Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down before you start commenting, "solution" will be at the end :-)
Is it correct that the raid must come into play after initial boot *if* one is to be able to boot on the remaining disk after primary fails?
Not sure what you mean exactly by that, but more or less, yes. The problem wasn't the RAID but grub not wanting to boot automatically or not at all.
My thinking is that if raid is in play when grub is installed, does it mess up where things are put? If not, does it mess up when there's a failure (raid 0,....5... I guess would affect that).
You would want to boot from the disk that did *not* fail or at least resync in the right direction. I think that's also the reason why the resync isn't done automatically. If you happen to boot in the "wrong" disk (and given that it still works but is just a bit "different") and it is then going to resync in the wrong direction ... I'm not quite sure how to determine which is the "correct" disk, though. dmesg(or was it boot.log) shows a lot of lines "considering /dev/hda1" and so on from the RAID detection. It binds the correct disk to the array and then "considers" the next partition. When it encounters the "dropped" partition it won't use that as the "master" for an md partition because that md partition is already active. So, all is well and I can resync then. However, if it were to do this in another order, e.g. first "consider" the failed disk ... - I don't know what would happen. Maybe it goes after some timestamp, in that case the failed disk should always have an older timestamp since it dropped out of the array. Or uses another method to mark that disk as "stale" or whatever. In that case I think one doesn't have to worry about booting in the wrong disk.
Is it boot from boot partition, install grub, copy to second disk.
No, I installed a normal RAID setup with the CentOS install CD on both disks (no LVM) and then installed grub to the second disk additionally after setup had finished. That could be done from the setup screen as well but it's not supported, you can only choose which disk to install to, but not to both of them. A serious flaw in the setup screen I think, but one made at the upstream vendor. I think your scenario above would very well fit if one of the drives failed and got replaced. I'd then need to copy the partition structure to it, resync and then setup grub on it (or setup grub first and then resync). I'm not sure how I should copy the partition structure. The grub+RAID howto (3 years old) gives the following for copying the structure: # sfdisk - /dev/hdx | sfdisk /dev/hdy Does this sound to be the right way?
Is
grub (stage 1?) installed in *partition* or in mbr? If in partition, partition must be made active? Both drives?
That's all done by the setup (I checked later). I'm not sure "where" grub gets installed. AFAIK there's a boot loader put into the MBR that will then load grub from the partition it was installed to (which is mounted as /boot).
If raid is active when you try to write initial loader to mbr/partition, does it mess it up?
Doesn't look like so. I think that's because MBR isn't part of the RAID.
Can you get to where you can boot one disk w/o raid active?
It seems that it gets automatically detected that the disk is part of a RAID array. So, if you just boot one disk it is still booted as RAID, just with one active disk.
It would be
aggravating if it was nothing raid related, but some hardware/user/software error making you think it was your raid doing it to you.
I was sure it wasn't the RAID, the problem was grub.
I'd appreciate if anyone reading my comments and sees some need for correction or improvements do so, thanks.
Ok, what I did is start over. First I installed again by just formatting the partition but let the partition structure live on. That gave me more or less the same result. It seems the grub stuff didn't get properly written over that way. So I removed the partitions and cleaned the whole disk. And did another installation. This time everything seems to be doing fine. However, I tested only by shutting down, removing one of the disks and then booting up. I can pull out any of the disks, boot from the other and resync the pulled out one. I didn't test dropping one of the disks in the middle of operation yet.
There are two things I realized:
1. disks need to run on different IDE controllers it seems. So, one has to be primary master, the other secondary master. At least for the boot disks and how it works here. I had them first hanging as master and slave on the same controller and then I get the boot disk failure I mentioned in my first posting when the master is pulled out. The good thing about this is that I was able to do this without any changes or setting up the system again. It recognized the disk moving from hdb to hdc and I just needed to resync.
2. the grub.conf doesn't seem to matter much. Excerpt:
default=0 fallback=2
title CentOS (2.6.9-34.0.1.EL) root (hd0,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img title CentOS 2 root (hd1,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img
Here it always boots 0, no matter if disk 0 is powered down or powered up, but not active in the array anymore. It somehow finds the correct setup, anyway.
Kai
On Thu, 2006-06-08 at 17:36 +0200, Kai Schaetzl wrote:
Is it boot from boot partition, install grub, copy to second disk.
No, I installed a normal RAID setup with the CentOS install CD on both disks (no LVM) and then installed grub to the second disk additionally after setup had finished. That could be done from the setup screen as well but it's not supported, you can only choose which disk to install to, but not to both of them. A serious flaw in the setup screen I think, but one made at the upstream vendor.
There are two problems. One is that you have to figure out what physical disks make up the raid mirror. That's not too hard because you have the partition device names somewhere and can probably figure out the underlying disk device from that and write the grub loader into the MBRs. The other problem is more complicated. Once the bios does the initial boot for you, loading the first part of grub, grub needs to know how to tell bios to load the rest of the kernel. That means that grub -when installed- needs to know which bios device will hold the /boot partition. Scsi controllers generally map the drives in the order detected and will 'shift up' the second drive in the bios perspective if the first one fails. That means you can install an identically-configured grub on both drives. IDE, on the other hand does not shift the bios view of the drives so some of the HOWTO instructions you'll find for hand-installing grub take this into account. However, most of the ways that IDE drives fail will make the machine unbootable until you open the case and unplug it. Then you may have to adjust the bios settings to boot from the other position - but while you have the case open it is probably easier to shift the jumper or cable position so the working drive is the primary. Then if you followed one of the sets of instructions, grub will load but will be looking for the now-missing 2nd drive for the /boot partition to load the kernel. SATA probably has it's own way of doing things too.
Anyway, the fix is to boot the install/rescue CD with 'linux rescue' at the boot prompt, do the chroot it suggests, then reinstall grub making sure that the device mapping is right for your current disk configuration.
I think your scenario above would very well fit if one of the drives failed and got replaced. I'd then need to copy the partition structure to it, resync and then setup grub on it (or setup grub first and then resync). I'm not sure how I should copy the partition structure. The grub+RAID howto (3 years old) gives the following for copying the structure: # sfdisk - /dev/hdx | sfdisk /dev/hdy Does this sound to be the right way?
Maybe... It's not that hard to do it by hand. Do an fdisk -l on the existing drive, then an interactive fdisk of the new mate, creating the same sized partitions with type FD.
cat /proc/mdstat will show you the current raid setup use mdadm --add /dev/mdxx /dev/hdxx to add a partition (replacing the xx's with your real identifiers), then 'cat /proc/mdstat' will show the progress of the re-sync. And you can re-install grub, but you'll have the same issue with drive positions next time around.
Les Mikesell wrote on Thu, 08 Jun 2006 11:17:18 -0500:
There are two problems. One is that you have to figure out what physical disks make up the raid mirror. That's not too hard because you have the partition device names somewhere and can probably figure out the underlying disk device from that and write the grub loader into the MBRs.
Well, my point was that the setup should do this. I could do it, but it doesn't. That's all ;-)
The other problem is more complicated. Once the bios
does the initial boot for you, loading the first part of grub, grub needs to know how to tell bios to load the rest of the kernel. That means that grub -when installed- needs to know which bios device will hold the /boot partition. Scsi controllers generally map the drives in the order detected and will 'shift up' the second drive in the bios perspective if the first one fails. That means you can install an identically-configured grub on both drives.
I see.
IDE, on the other hand
does not shift the bios view of the drives so some of the HOWTO instructions you'll find for hand-installing grub take this into account.
It still seems to boot the same OS from the grub.conf, though. No matter which of the two disks is powered down. It always boot my default and not the fallback which it should boot theoretically once the first disk is gone.
However, most of the ways that IDE drives fail will make
the machine unbootable until you open the case and unplug it.
Yes, I feared this. It's a bit hard to establish such a situation for testing, though ;-) How to maske a disk fail without damaging it? How can I nuke the grub on the first disk to see what happens?
Then
you may have to adjust the bios settings to boot from the other position - but while you have the case open it is probably easier to shift the jumper or cable position so the working drive is the primary. Then if you followed one of the sets of instructions, grub will load but will be looking for the now-missing 2nd drive for the /boot partition to load the kernel. SATA probably has it's own way of doing things too.
Didn't have these problems here. I just made the mistake of putting both on the same controller channel first. After moving one to ide0 and one to ide1 it works just fine. Thgis may be different with other controllers, of course, I understand. The machine I'm testing this one is a few years old, including the disks.
Maybe... It's not that hard to do it by hand. Do an fdisk -l on the existing drive, then an interactive fdisk of the new mate, creating the same sized partitions with type FD.
I hate to use fdisk, haven't done this for a long time. If there are GUI ways I much prefer them in a few cases over command line :-)
cat /proc/mdstat will show you the current raid setup
Thanks, I wasn't aware of that. It's quicker than using mdadm --detail.
use mdadm --add /dev/mdxx /dev/hdxx
yes, that's what I used (-a).
Kai
On Fri, 2006-06-09 at 18:31 +0200, Kai Schaetzl wrote:
IDE, on the other hand
does not shift the bios view of the drives so some of the HOWTO instructions you'll find for hand-installing grub take this into account.
It still seems to boot the same OS from the grub.conf, though. No matter which of the two disks is powered down. It always boot my default and not the fallback which it should boot theoretically once the first disk is gone.
If you followed the HOWTO you may have accounted for the different drive in the grub setup.
However, most of the ways that IDE drives fail will make
the machine unbootable until you open the case and unplug it.
Yes, I feared this. It's a bit hard to establish such a situation for testing, though ;-) How to maske a disk fail without damaging it? How can I nuke the grub on the first disk to see what happens?
Power down and pull the cable from the drive.
Then
you may have to adjust the bios settings to boot from the other position - but while you have the case open it is probably easier to shift the jumper or cable position so the working drive is the primary. Then if you followed one of the sets of instructions, grub will load but will be looking for the now-missing 2nd drive for the /boot partition to load the kernel. SATA probably has it's own way of doing things too.
Didn't have these problems here.
You can't emulate that problem without a broken drive. Some failure modes just hang forever and the machine will never go on to the next one as long as the bad one is connected.
Maybe... It's not that hard to do it by hand. Do an fdisk -l on the existing drive, then an interactive fdisk of the new mate, creating the same sized partitions with type FD.
I hate to use fdisk, haven't done this for a long time. If there are GUI ways I much prefer them in a few cases over command line :-)
And I hate to use things where I can't look at the results before the final save.
Les Mikesell wrote on Fri, 09 Jun 2006 14:00:54 -0500:
If you followed the HOWTO you may have accounted for the different drive in the grub setup.
I don't know what you mean. I followed the howto and everything works. The point is: in my opinion it should *not* work. No matter which drive I remove it always boots successfully from the default 0 label, although that is on hd0 which is gone if I remove that drive. Theoretically, it should fail to boot from hd0 and fallback to hd1. This is IDE, not SCSI.
(I can also boot fine from hd1 if I interrupt the automatic boot, so grub is setup correctly.)
Kai
On Sat, 2006-06-10 at 19:31 +0200, Kai Schaetzl wrote:
Les Mikesell wrote on Fri, 09 Jun 2006 14:00:54 -0500:
If you followed the HOWTO you may have accounted for the different drive in the grub setup.
I don't know what you mean. I followed the howto and everything works. The point is: in my opinion it should *not* work. No matter which drive I remove it always boots successfully from the default 0 label, although that is on hd0 which is gone if I remove that drive. Theoretically, it should fail to boot from hd0 and fallback to hd1. This is IDE, not SCSI.
(I can also boot fine from hd1 if I interrupt the automatic boot, so grub is setup correctly.)
Kai
Kai,
One thing that I think you may be missing is this ...
In grub ... hd0 is the first found hard drive ... hd1 is the second found hard drive, etc.
IF ... you remove the primary hard drive (ie power it off, and unplug the cable) ... then the drive that is left (that used to be hd1) is now hd0
So ... when booting, it will be seen as hd0 and you won't have an hd1.
Well ... at least that is what I have experienced in the past ... maybe someone else who is smarter than me would care to comment / verify this behavior.
Thanks, Johnny Hughes
On Sat, 2006-06-10 at 13:03 -0500, Johnny Hughes wrote:
On Sat, 2006-06-10 at 19:31 +0200, Kai Schaetzl wrote:
Les Mikesell wrote on Fri, 09 Jun 2006 14:00:54 -0500:
If you followed the HOWTO you may have accounted for the different drive in the grub setup.
I don't know what you mean. I followed the howto and everything works. The point is: in my opinion it should *not* work. No matter which drive I remove it always boots successfully from the default 0 label, although that is on hd0 which is gone if I remove that drive. Theoretically, it should fail to boot from hd0 and fallback to hd1. This is IDE, not SCSI.
(I can also boot fine from hd1 if I interrupt the automatic boot, so grub is setup correctly.)
Kai
Kai,
One thing that I think you may be missing is this ...
In grub ... hd0 is the first found hard drive ... hd1 is the second found hard drive, etc.
IF ... you remove the primary hard drive (ie power it off, and unplug the cable) ... then the drive that is left (that used to be hd1) is now hd0
So ... when booting, it will be seen as hd0 and you won't have an hd1.
Well ... at least that is what I have experienced in the past ... maybe someone else who is smarter than me would care to comment / verify this behavior.
Further, in spite of my recent embarrassing confusion, *if* BIOS is still involved (with RAID I don't know) and if you have "fail-over" enabled in the BIOS (try C:, if that fails, D:,..., floppy, CD,... etc.), the BIOS should change the device ID so that what was 0x81 becomes 0x80 (D: becomes C:).
Although that is in "Winspeak", the benefits to Linux accrue.
This presumes no earth-shattering changes in basic BIOS operations in the last 6/7 years. We had EBDA then and Extended System Configuration, so hopefully things are still similar.
Thanks, Johnny Hughes _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
me.local> From: "Kai Schaetzl" maillists@conactive.com X-Rcpt-To: centos@centos.org
Johnny Hughes wrote on Sat, 10 Jun 2006 13:03:49 -0500:
In grub ... hd0 is the first found hard drive ... hd1 is the second found hard drive, etc.
Yeah, you are right, thanks. Just reread parts of the grub documentation, it enumerates in the order the BIOS finds the drives.
Kai
One thing that I think you may be missing is this ...
In grub ... hd0 is the first found hard drive ... hd1 is the second found hard drive, etc.
IF ... you remove the primary hard drive (ie power it off, and unplug the cable) ... then the drive that is left (that used to be hd1) is now hd0
So ... when booting, it will be seen as hd0 and you won't have an hd1.
Well ... at least that is what I have experienced in the past ... maybe someone else who is smarter than me would care to comment / verify this behavior.
I won't say that I am smarter than Johnny but I will verify that this is the behaviour of grub and is so documented.
"Note that GRUB does not distinguish IDE from SCSI - it simply counts the drive numbers from zero, regardless of their type. Normally, any IDE drive number is less than any SCSI drive number, although that is not true if you change the boot sequence by swapping IDE and SCSI drives in your BIOS."
found at http://www.gnu.org/software/grub/manual/html_node/Naming-convention.html
So having the default menu item load from the first disk works in this case.
On Sun, 2006-06-11 at 10:32, Feizhou wrote:
Well ... at least that is what I have experienced in the past ... maybe someone else who is smarter than me would care to comment / verify this behavior.
I won't say that I am smarter than Johnny but I will verify that this is the behaviour of grub and is so documented.
"Note that GRUB does not distinguish IDE from SCSI - it simply counts the drive numbers from zero, regardless of their type. Normally, any IDE drive number is less than any SCSI drive number, although that is not true if you change the boot sequence by swapping IDE and SCSI drives in your BIOS."
found at http://www.gnu.org/software/grub/manual/html_node/Naming-convention.html
So having the default menu item load from the first disk works in this case.
Hmmm, maybe the machines where I've had odd behavior have been mixed ide/scsi with bios set to boot from the 1st scsi. But some of the newer ones make you specify in the bios setup the complete order of boot devices and don't like it if you swap in a different sized device of the same type without running the setup again.
On Sat, 2006-06-10 at 12:31, Kai Schaetzl wrote:
If you followed the HOWTO you may have accounted for the different drive in the grub setup.
I don't know what you mean. I followed the howto and everything works. The point is: in my opinion it should *not* work. No matter which drive I remove it always boots successfully from the default 0 label, although that is on hd0 which is gone if I remove that drive.
I think this depends on your bios - they may or may not all map the first hd they find into the first bios slot. But it also depends on it not detecting the failed drive at all, which probably won't happen until you open the case and unplug it.
Kai Schaetzl spake the following on 6/8/2006 8:36 AM:
William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400:
I can't be much help at all
Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down before you start commenting, "solution" will be at the end :-)
Is it correct that the raid must come into play after initial boot *if* one is to be able to boot on the remaining disk after primary fails?
Not sure what you mean exactly by that, but more or less, yes. The problem wasn't the RAID but grub not wanting to boot automatically or not at all.
My thinking is that if raid is in play when grub is installed, does it mess up where things are put? If not, does it mess up when there's a failure (raid 0,....5... I guess would affect that).
You would want to boot from the disk that did *not* fail or at least resync in the right direction. I think that's also the reason why the resync isn't done automatically. If you happen to boot in the "wrong" disk (and given that it still works but is just a bit "different") and it is then going to resync in the wrong direction ... I'm not quite sure how to determine which is the "correct" disk, though. dmesg(or was it boot.log) shows a lot of lines "considering /dev/hda1" and so on from the RAID detection. It binds the correct disk to the array and then "considers" the next partition. When it encounters the "dropped" partition it won't use that as the "master" for an md partition because that md partition is already active. So, all is well and I can resync then. However, if it were to do this in another order, e.g. first "consider" the failed disk ... - I don't know what would happen. Maybe it goes after some timestamp, in that case the failed disk should always have an older timestamp since it dropped out of the array. Or uses another method to mark that disk as "stale" or whatever. In that case I think one doesn't have to worry about booting in the wrong disk.
Is it boot from boot partition, install grub, copy to second disk.
No, I installed a normal RAID setup with the CentOS install CD on both disks (no LVM) and then installed grub to the second disk additionally after setup had finished. That could be done from the setup screen as well but it's not supported, you can only choose which disk to install to, but not to both of them. A serious flaw in the setup screen I think, but one made at the upstream vendor. I think your scenario above would very well fit if one of the drives failed and got replaced. I'd then need to copy the partition structure to it, resync and then setup grub on it (or setup grub first and then resync). I'm not sure how I should copy the partition structure. The grub+RAID howto (3 years old) gives the following for copying the structure: # sfdisk - /dev/hdx | sfdisk /dev/hdy Does this sound to be the right way?
Pardon my chiming in, but this is an adequate way to copy the partition data, and that is what I used on my software raid systems. Don't forget that the first part is sfdisk -d, not just sfdisk -. Just be careful that you specify the correct devices in the right places. Otherwise you will have 2 blank disks! You could use "sfdisk -d /dev/hdx >somefile" and you would have a backup of the partition structure just in case.
Is
grub (stage 1?) installed in *partition* or in mbr? If in partition, partition must be made active? Both drives?
That's all done by the setup (I checked later). I'm not sure "where" grub gets installed. AFAIK there's a boot loader put into the MBR that will then load grub from the partition it was installed to (which is mounted as /boot).
If raid is active when you try to write initial loader to mbr/partition, does it mess it up?
Doesn't look like so. I think that's because MBR isn't part of the RAID.
Can you get to where you can boot one disk w/o raid active?
It seems that it gets automatically detected that the disk is part of a RAID array. So, if you just boot one disk it is still booted as RAID, just with one active disk.
It would be
aggravating if it was nothing raid related, but some hardware/user/software error making you think it was your raid doing it to you.
I was sure it wasn't the RAID, the problem was grub.
I'd appreciate if anyone reading my comments and sees some need for correction or improvements do so, thanks.
Ok, what I did is start over. First I installed again by just formatting the partition but let the partition structure live on. That gave me more or less the same result. It seems the grub stuff didn't get properly written over that way. So I removed the partitions and cleaned the whole disk. And did another installation. This time everything seems to be doing fine. However, I tested only by shutting down, removing one of the disks and then booting up. I can pull out any of the disks, boot from the other and resync the pulled out one. I didn't test dropping one of the disks in the middle of operation yet.
Don't do that! Don't test by pulling a running disk unless it is in hotplug capable hardware. Test by using mdadm to remove that drive.
There are two things I realized:
- disks need to run on different IDE controllers it seems. So, one has to
be primary master, the other secondary master. At least for the boot disks and how it works here. I had them first hanging as master and slave on the same controller and then I get the boot disk failure I mentioned in my first posting when the master is pulled out. The good thing about this is that I was able to do this without any changes or setting up the system again. It recognized the disk moving from hdb to hdc and I just needed to resync.
That info is in the software raid howto. Some systems are more tolerant than others, but usually a failed drive will lock the entire channel, so the primary and the slave would go down.
- the grub.conf doesn't seem to matter much. Excerpt:
default=0 fallback=2
title CentOS (2.6.9-34.0.1.EL) root (hd0,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img title CentOS 2 root (hd1,0) kernel /vmlinuz-2.6.9-34.0.1.EL ro root=/dev/md1 initrd /initrd-2.6.9-34.0.1.EL.img
Here it always boots 0, no matter if disk 0 is powered down or powered up, but not active in the array anymore. It somehow finds the correct setup, anyway.
Kai
Scott Silva wrote on Thu, 08 Jun 2006 09:29:12 -0700:
Pardon my chiming in,
why should I take offense? Thanks!
but this is an adequate way to copy the partition data,
and that is what I used on my software raid systems. Don't forget that the first part is sfdisk -d, not just sfdisk -.
Yeah, my typing! Thanks for the confirmation, I'll put it in my basket of valuable snippets.
the pulled out one. I didn't test dropping one of the disks in the middle of operation yet.
Don't do that! Don't test by pulling a running disk unless it is in hotplug capable hardware. Test by using mdadm to remove that drive.
That's not a real test ;-) I can test out and learn quite a few things by less harmful ways but I don't know what happens if I rip it out in the middle of operation. After all, that's what's going to happen when it really fails. I did it already once and the drive survived, I'll do it again. I use two old 10 GB drives for testing. I'd regret if I lost one of them since after that I have only *very* old drives for further testing, but it's not a real problem.
There are two things I realized:
- disks need to run on different IDE controllers it seems.
That info is in the software raid howto. Some systems are more tolerant than others, but usually a failed drive will lock the entire channel, so the primary and the slave would go down.
Yeah, I didn't read that part of the how-to I guess. On a non-testing machine I wouldn't have put the drives on one channel, anyway, but in this case it was the easiest and fastest option. And I learned something from that :-) Actually, they didn't go down both, but the bootup failed. There was a whole lot of IDE errors on the console, though, after I pulled the cable.
Kai
Kai Schaetzl spake the following on 6/9/2006 9:31 AM:
Scott Silva wrote on Thu, 08 Jun 2006 09:29:12 -0700:
Pardon my chiming in,
why should I take offense? Thanks!
but this is an adequate way to copy the partition data,
and that is what I used on my software raid systems. Don't forget that the first part is sfdisk -d, not just sfdisk -.
Yeah, my typing! Thanks for the confirmation, I'll put it in my basket of valuable snippets.
the pulled out one. I didn't test dropping one of the disks in the middle of operation yet.
Don't do that! Don't test by pulling a running disk unless it is in hotplug capable hardware. Test by using mdadm to remove that drive.
That's not a real test ;-) I can test out and learn quite a few things by less harmful ways but I don't know what happens if I rip it out in the middle of operation. After all, that's what's going to happen when it really fails. I did it already once and the drive survived, I'll do it again. I use two old 10 GB drives for testing. I'd regret if I lost one of them since after that I have only *very* old drives for further testing, but it's not a real problem.
There are two things I realized:
- disks need to run on different IDE controllers it seems.
That info is in the software raid howto. Some systems are more tolerant than others, but usually a failed drive will lock the entire channel, so the primary and the slave would go down.
Yeah, I didn't read that part of the how-to I guess. On a non-testing machine I wouldn't have put the drives on one channel, anyway, but in this case it was the easiest and fastest option. And I learned something from that :-) Actually, they didn't go down both, but the bootup failed. There was a whole lot of IDE errors on the console, though, after I pulled the cable.
Kai
Actually, a failed drive will time out and error, but yanking it out while running could "potentially" create a spike that could theoretically kill the entire machine. Why don't you at least get a couple of removable ide trays with keylocks. The run about $10 US each, and turning the key will just kill the power to the drive, and not do more damage.
On Thu, 2006-06-08 at 17:36 +0200, Kai Schaetzl wrote:
William L. Maltby wrote on Tue, 06 Jun 2006 15:59:13 -0400:
I can't be much help at all
Well, you made the "mistake" of replying ;-) Thanks, anyway. Read down before you start commenting, "solution" will be at the end :-)
Nothing to say but thanks. I now have something for when I try my first raid. And I see that Les and others have now replied, so knowledge is propagated.
<snip>
A copy of this is now in my "Valuable Posts" file. Thanks!
Kai