On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote: > Hello Everyone, > > I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up. > > This happened several times within a few days of each other. Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up. We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU. Since then, it has run as long as two weeks, but still crashes randomly. After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen: > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0 > Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0) <snipped> > <0>Kernel panic - not syncing: Fatal exception OK everyone, here is an update: The server crashed again overnight. This time, the following error messages were on the console: HARDWARE ERROR CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 TSC 5172b45d44f0a MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405 TSC 5172b45d45bba MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR CPU 5: Machine Check Exception: 4 Bank 8: 0000000000000000 TSC 0 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Uncorrected machine check After reboot, running the first error through mcelog --ascii gives CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c CPU 3 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4 The second error gives CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c CPU 7 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4 And the third gives CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c CPU 3 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4 I have been able to move all workloads onto other servers. As at least two people suggested, I booted from the HP SmartStart CD and ran 100 loops of systems diagnostics and tests, especially for the memory and CPU. No problems were found. I think I will run memtest86 over the weekend. We have placed a hardware support call in to HP. Best Regards, Dave Windsor Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us