On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote:
Hello Everyone,
I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.
This happened several times within a few days of each other. Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up. We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU. Since then, it has run as long as two weeks, but still crashes randomly. After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen:
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0 Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0)
<snipped>
<0>Kernel panic - not syncing: Fatal exception
OK everyone, here is an update:
The server crashed again overnight. This time, the following error messages were on the console:
HARDWARE ERROR CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 TSC 5172b45d44f0a MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor
HARDWARE ERROR CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405 TSC 5172b45d45bba MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor
HARDWARE ERROR CPU 5: Machine Check Exception: 4 Bank 8: 0000000000000000 TSC 0 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Uncorrected machine check
After reboot, running the first error through mcelog --ascii gives
CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c
CPU 3 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4
The second error gives
CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c
CPU 7 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4
And the third gives
CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c
CPU 3 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4
I have been able to move all workloads onto other servers. As at least two people suggested, I booted from the HP SmartStart CD and ran 100 loops of systems diagnostics and tests, especially for the memory and CPU. No problems were found. I think I will run memtest86 over the weekend.
We have placed a hardware support call in to HP.
Best Regards,
Dave Windsor
Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us