On 4/1/2011 1:44 PM, Windsor Dave L. (AdP/TEF7) wrote:
On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote:
Hello Everyone,
I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.
OK everyone, here is an update:
The server crashed again overnight. This time, the following error messages were on the console:
HARDWARE ERROR CPU 3: Machine Check Exception: 4 Bank 5:
ba00000000400405 TSC 5172b45d44f0a MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor
<snipped>
I have been able to move all workloads onto other servers. As at least two people suggested, I booted from the HP SmartStart CD and ran 100 loops of systems diagnostics and tests, especially for the memory and CPU. No problems were found. I think I will run memtest86 over the weekend.
Best Regards,
Dave Windsor
This is interesting... I tried to load memtest86 from the CentOS 5.5 Install DVD, and the system immediately rebooted. I eventually loaded memtest86 from an OpenSUSE 11.4 install DVD I had laying around, and that ran OK.
I ran memtest86+ starting Friday about 6 pm and stopping Monday morning at 10:45 am. Almost 70 full passes were completed, and no errors were found.
Best Regards,
Dave Windsor
Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA