We have Xeon IA32 dual-processor servers running Centos 3.5 in an HPC batch-only compute grid configuration . We have yum update operating automatically with default updates being applied weekly. Because of the workload pattern of long-runing jobs, the servers tend to stay up without a reboot for very long periods.
Recently, yum installed an updated kernel 2.4.21-32.0.1.ELsmp; when we got around to rebooting, we found that some of the machines were running the uniprocessor kernel 2.4.21-32.0.1.EL , showing only a single cpu. The grub.conf file had been modified in the usual pushdown manner but the default kernel had been set at #2 instead of #0.
Bizzarely, some of the systems DID boot the upgraded SMP kernel as expected.
Here is the grub.conf from an affected server:
default=2 timeout=10 splashimage=(hd0,0)/grub/splash.xpm.gz title CentOS (2.4.21-32.0.1.ELsmp) root (hd0,0) kernel /vmlinuz-2.4.21-32.0.1.ELsmp ro root=LABEL=/ initrd /initrd-2.4.21-32.0.1.ELsmp.img title CentOS (2.4.21-32.0.1.EL) root (hd0,0) kernel /vmlinuz-2.4.21-32.0.1.EL ro root=LABEL=/ initrd /initrd-2.4.21-32.0.1.EL.img title CentOS-3 (2.4.21-32.ELsmp) root (hd0,0) kernel /vmlinuz-2.4.21-32.ELsmp ro root=LABEL=/ initrd /initrd-2.4.21-32.ELsmp.img title CentOS-3-up (2.4.21-32.EL) root (hd0,0) kernel /vmlinuz-2.4.21-32.EL ro root=LABEL=/ initrd /initrd-2.4.21-32.EL.img
Changing the default back to 0 has no effect, it still boots the 2.4.21-32.0.1.EL kernel and not the required SMP one. However, if we use the interactive GRUB boot menu & select the correct kernel interactively, it then boots SMP OK with both processors and all memory available.
I tried the obvious ploy of removing the last three kernel entries in grub.conf & setting default=0 but it still manages to boot the 2.4.21-32.0.1.EL UP kernel even though it is no longer in the kernel menu list.
We think we will disable automatic yum kernel updates in future , but meanwhile, has anyone any suggestions or experiences to share on this apart from a complete re-install of each affected node?
Les Oswald
Dr R L Oswald wrote:
Recently, yum installed an updated kernel 2.4.21-32.0.1.ELsmp; when we got around to rebooting, we found that some of the machines were running the uniprocessor kernel 2.4.21-32.0.1.EL , showing only a single cpu. The grub.conf file had been modified in the usual pushdown manner but the default kernel had been set at #2 instead of #0.
I had an extremely similar thing happen to me on a RHEL3 box when I applied update 5. I didn't notice it for a few days so I just assumed I had made a mistake, but your post inclines me to believe its a bug in the upgrade scripts.
-jim
Jim Bartus wrote:
Dr R L Oswald wrote:
Recently, yum installed an updated kernel 2.4.21-32.0.1.ELsmp; when we got around to rebooting, we found that some of the machines were running the uniprocessor kernel 2.4.21-32.0.1.EL , showing only a single cpu. The grub.conf file had been modified in the usual pushdown manner but the default kernel had been set at #2 instead of #0.
I had an extremely similar thing happen to me on a RHEL3 box when I applied update 5. I didn't notice it for a few days so I just assumed I had made a mistake, but your post inclines me to believe its a bug in the upgrade scripts.
It happened to me too. I figured I was the one made an error and manually fixed the "broken" ones. Mine were a mixture of HT-enabled P4's (which would normally run the smp kernel) and some dual opteron machines. These are all rack servers configured identically so I'm not sure what triggered the behaviour in some of them without affecting them all. Wierd.
Cheers,
I'd recommend comparing the BIOS versions with something like: "dmidecode | grep -A4 BIOS" to see if there's a correlation between the ones that got the UP kernel instead of the SMP kernel.
It happened to me too. I figured I was the one made an error and manually fixed the "broken" ones. Mine were a mixture of HT-enabled P4's (which would normally run the smp kernel) and some dual opteron machines. These are all rack servers configured identically so I'm not sure what triggered the behaviour in some of them without affecting them all. Wierd.
-Jay
Good Day Les,
First off, I tried a reply off list, but your mailbox is not responding to direct input. My appologies to the list for an off topic reply...
I'm sorry I can't help with your question, but I did want to ask since you appear to be running a similar setup as I have here with the dual Xeons. Might I ask what software are you running on the machines? I'm looking for someone with a handle on the WRF or MM5 numerical model that has gotten the thing to compile with either the Intel compiler or with the Portland group compiler with either/and / or both models. I have not switched to the yum update for automatic software updates, rather staying with the up2date as it came on installation. I'm very new to CentOS and have not done much in the way of updating except what comes down the tubes from RH. So far, I've not seen any kernal updates to date...
Regards,
Sam
rather staying with the up2date as it came on installation. I'm very new to CentOS and have not done much in the way of updating except what comes down the tubes from RH. So far, I've not seen any kernal updates to date...
Hi Sam,
I believe the default for up2date is to exclude kernel updates. run "up2date -config" and check the second tab.
-Jay
Hi Jay,
The command is --configure :-) close tho. I see from that the kernels are in fact excluded. Is this a "good thing" tm. or should I let it update the kernels too?
Sam Drinkard wrote:
Hi Jay,
The command is --configure :-) close tho. I see from that the
kernels are in fact excluded. Is this a "good thing" tm. or should I let it update the kernels too?
If things are working for you, you don't necessarily have to upgrade the kernel. If you're running kernel 2.6.9-11, then you're most likely already running the latest (at least until update 2 is released)
You could always remove that kernel exclusion, then opt whether or not to put a checkmark next to the kernel when you run up2date.
Keep in mind that if you use nvidia's graphics driver, or other kernel modules added after the fact, you'll have to re-install/re-build them.
-Jay
I'm glad you mentioned that Jay. I've got the ATI graphics drivers installed and that is a kernel module or patch, whichever the case be. You are correct, if it ain't broke, I sure am not going to try to fix it!
If things are working for you, you don't necessarily have to upgrade the kernel. If you're running kernel 2.6.9-11, then you're most likely already running the latest (at least until update 2 is released)
You could always remove that kernel exclusion, then opt whether or not to put a checkmark next to the kernel when you run up2date.
Keep in mind that if you use nvidia's graphics driver, or other kernel modules added after the fact, you'll have to re-install/re-build them.
-Jay _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
hey,
Dr R L Oswald wrote:
Recently, yum installed an updated kernel 2.4.21-32.0.1.ELsmp; when we got around to rebooting, we found that some of the machines were running the uniprocessor kernel 2.4.21-32.0.1.EL , showing only a single cpu.
Easy workaround = just yum erase the UP kernel package, that way the system can only come back with a SMP kernel.
The grub.conf file had been modified in the usual pushdown manner but the default kernel had been set at #2 instead of #0.
Do you have a sample from 'before' the update ? also what version of mkinitrd do you have installed on these machines ? Was that updated at the same time as the kernel ? Also, what does /etc/redhat-release say ?
Here is the grub.conf from an affected server: default=2 timeout=10 splashimage=(hd0,0)/grub/splash.xpm.gz
Changing the default back to 0 has no effect, it still boots the 2.4.21-32.0.1.EL kernel and not the required SMP one. However, if we use
disable the splashimage, and reboot the machine with default=0, what kernel version is highlighted as the default ?
I tried the obvious ploy of removing the last three kernel entries in grub.conf & setting default=0 but it still manages to boot the 2.4.21-32.0.1.EL UP kernel even though it is no longer in the kernel menu list.
Are you sure the grub.conf you are editing is indeed the one that is being used ? ( should be the /boot/grub/grub.conf file ) What does 'parted <bootdev> print' say ?
We think we will disable automatic yum kernel updates in future , but meanwhile, has anyone any suggestions or experiences to share on this apart from a complete re-install of each affected node?
I would suggest you provide some more info, and also try to reinstall grub. At the very least grub should accomodate changes being made in the /boot/grub/grub.conf file.
fwiw, I've tried to reproduce this issue here on a CentOS3/i386 SMP machine [1] and am unable to do so. The kernel update installs and sets up grub.conf fine.
- K
[1] CentOS 3.4 install and yum update from there.