[CentOS] Kernel Panic on HP/Compaq ProLiant G7

Fri Apr 1 17:44:31 UTC 2011
Windsor Dave L. (AdP/TEF7) <Dave.Windsor at us.bosch.com>




On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote:
> Hello Everyone,
>
> I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7.  I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7.  On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop".  The system stopped responding on the network, and there was no video on the console (or remote console via iLO).  It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.
>
> This happened several times within a few days of each other.  Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up.   We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU.  Since then, it has run as long as two weeks, but still crashes randomly.  After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen:
>
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff8100dc435cf0  CR3: 000000008a6ca000 CR4: 00000000000006e0
> Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0)
<snipped>
>   <0>Kernel panic - not syncing: Fatal exception

OK everyone, here is an update:

The server crashed again overnight. This time, the following error 
messages were on the console:

     HARDWARE ERROR
     CPU 3: Machine Check Exception:                4 Bank 5: 
ba00000000400405
     TSC 5172b45d44f0a MISC 80
     This is not a software problem!
     Run through mcelog --ascii to decode and contact your hardware vendor

     HARDWARE ERROR
     CPU 7: Machine Check Exception:                4 Bank 5: 
ba00000000400405
     TSC 5172b45d45bba MISC 80
     This is not a software problem!
     Run through mcelog --ascii to decode and contact your hardware vendor

     HARDWARE ERROR
     CPU 5: Machine Check Exception:                4 Bank 8: 
0000000000000000
     TSC 0
     This is not a software problem!
     Run through mcelog --ascii to decode and contact your hardware vendor
     Kernel panic - not syncing: Uncorrected machine check

After reboot, running the first error through mcelog --ascii gives

     CPU 3: Machine Check Exception:                4 Bank 5: 
ba00000000400405
     HARDWARE ERROR. This is *NOT* a software problem!
     Please contact your hardware vendor
     mcelog: Unknown Intel CPU type family 6 model 2c

     CPU 3 BANK 5 MCG status:MCIP
     MCi status:
     Uncorrected error
     Error enabled
     MCi_MISC register valid
     Processor context corrupt
     MCA: Internal unclassified error: 405
     STATUS ba00000000400405 MCGSTATUS 4

The second error gives

     CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405
     HARDWARE ERROR. This is *NOT* a software problem!
     Please contact your hardware vendor
     mcelog: Unknown Intel CPU type family 6 model 2c

     CPU 7 BANK 5 MCG status:MCIP
     MCi status:
     Uncorrected error
     Error enabled
     MCi_MISC register valid
     Processor context corrupt
     MCA: Internal unclassified error: 405
     STATUS ba00000000400405 MCGSTATUS 4

And the third gives

     CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405
     HARDWARE ERROR. This is *NOT* a software problem!
     Please contact your hardware vendor
     mcelog: Unknown Intel CPU type family 6 model 2c

     CPU 3 BANK 5 MCG status:MCIP
     MCi status:
     Uncorrected error
     Error enabled
     MCi_MISC register valid
     Processor context corrupt
     MCA: Internal unclassified error: 405
     STATUS ba00000000400405 MCGSTATUS 4

I have been able to move all workloads onto other servers.  As at least 
two people suggested, I booted from the HP SmartStart CD and ran 100 
loops of systems diagnostics and tests, especially for the memory and 
CPU.  No problems were found.  I think I will run memtest86 over the 
weekend.

We have placed a hardware support call in to HP.

Best Regards,

Dave Windsor

Robert Bosch LLC
Team Leader, MES Database Infrastructure Group (AdP/TEF7.1)
4421 Highway 81 North
Anderson, SC 29621 USA
www.bosch.us