[CentOS] Kernel Panic on HP/Compaq ProLiant G7

Thu Mar 24 18:01:26 UTC 2011
Windsor Dave L. (AdP/TEF7) <Dave.Windsor at us.bosch.com>

On 3/24/2011 1:44 PM, Alain Péan wrote:
> Le 24/03/2011 18:30, Dave Windsor a écrit :
>> On 3/24/2011 12:37 PM, Alain Péan wrote:
>>> Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :
>>>> <snipped>
>>>> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc
>>>> RIP  [<ffff8100dc435cf0>]
>>>>     RSP<ffff81001529fd18>
>>>> CR2: ffff8100dc435cf0
>>>>     <0>Kernel panic - not syncing: Fatal exception
>>>>
>>>> <snipped>
>>>> I am trying to determine if this is pointing to a hardware or software issue.  Some of the Google results suggested using a Centosplus kernel - is this a good idea?
>>>>
>>>> The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller.  The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.
>>>>
>>>> Any idea where I should look next?
>>>>
>>>> Thanks for any help anyone can provide!
>>>>
>>> The fact that it appears after two weeks or so reminds me of a bug I
>>> saw on linux PowerEdge mailing list, //the "blocked for more than 120
>>> seconds" timeout bug.
>>> I don't know if your problem is related, but if it is the case you
>>> should see the message in your logs.
>>>
>>> Do you have any high IO load, at least at some moments, on your server ?
>>>
>>> See :
>>> http://lists.us.dell.com/pipermail/linux-poweredge/2011-March/044515.html
>>>
>>> In this case, using a newer kernel would be indeed it seems a good idea.
>>>
>>> See if it can help...
>>>
>>> Alain
>>> //
>> Alain,
>>
>> Today, there are not high I/O loads.  This server was intended to
>> replace two older HP-UX servers.  I had just begun to migrate the
>> workload to the new server when the crashes began to occur.  There are
>> some minor, sporadic I/O loads but nothing that I would think could
>> trigger the bug discussed in your link.  However, I haven't measured the
>> workload closely yet, so there could be spikes.
>>
>> Best Regards,
>>
>> *Dave Windsor*
>
> Your error message, "Kernel panic - not syncing: Fatal exception" is too
> generic to give any clue. Do you see other error messages in your log ?
>
> Did you run any hardware test (with Dell you have such utilities on DVD,
> I think they exist also on HP), to see if some hardware is failing, for
> example RAM ?
>
> Alain
>

There are no error messages in any logs.  For example, in 
/var/log/messages, everything looks normal until you see the kernel 
restart messages after the reboot, although there seems to be a long gap 
in time between the last entry and the time when the systems actually 
stopped and was restarted.  Whatever is happening, the system doesn't 
seem to be in a state where the problem can be recorded.

By the way, I forgot to list my kernel version.  uname -rmi gives:
2.6.18-194.32.1.el5 x86_64 x86_64

--
Best Regards,

Dave Windsor

Robert Bosch LLC
Team Leader, MES Database Infrastructure Group (AdP/TEF7.1)
4421 Highway 81 North
Anderson, SC 29621 USA
www.bosch.us

Tel: 1 (864) 260-8459
Fax: 1 (864) 260-8422
Dave.Windsor at us.bosch.com