[CentOS] Upgrading to Centos 4.7 on HP DL580G5 caused problems

Sun Oct 26 21:09:18 UTC 2008
Dr Les Oswald <L.Oswald at cranfield.ac.uk>

As part of patching a cluster which has two DL580G5 login nodes ( 4X 
Intel 7300 DC cpus) & 24 HP DL160G5 compute nodes ( 2x Intel 5272 DC 
cpus) we encountered an issue that I would like to record:

I upgraded both DL580s to Centos 4.7 via yum but only rebooted one 
initially- this node, previously bomb-proof, started to hang randomly 
with no obvious messages logged to help with diagnosis.

In the dmessage output I found this sequence never seem before

Uhhuh. NMI received for unknown reason 20.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
Uhhuh. NMI received for unknown reason 30.
Dazed and confused, but trying to continue

(repeated several times)

Googling revealed many different scenarios with this boot error message, 
some suggesting a memory error - Oh Joy, these two machines have 64GB 
RAM each.

I then changed grub.conf to boot to the previous kernel 
2.6.9-67.0.15.ELsmp instead of the updated version of 2.6.9-78.0.5.ELsmp.

The boot-time error messages immediately went away and so far the 
systems are reliable.

Has anyone an explanation or confirmation that they have seen or 
overcome the above issue? I should mention that the DL160 compute nodes 
have not exhibited this behaviour at all.

Les Oswald

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20081026/e76e7dbf/attachment-0003.html>