 
            On 3/24/2011 1:44 PM, Alain Péan wrote:
Le 24/03/2011 18:30, Dave Windsor a écrit :
On 3/24/2011 12:37 PM, Alain Péan wrote:
Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :
<snipped> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception
<snipped> I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?
The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.
Any idea where I should look next?
Thanks for any help anyone can provide!
The fact that it appears after two weeks or so reminds me of a bug I saw on linux PowerEdge mailing list, //the "blocked for more than 120 seconds" timeout bug. I don't know if your problem is related, but if it is the case you should see the message in your logs.
Do you have any high IO load, at least at some moments, on your server ?
See : http://lists.us.dell.com/pipermail/linux-poweredge/2011-March/044515.html
In this case, using a newer kernel would be indeed it seems a good idea.
See if it can help...
Alain //
Alain,
Today, there are not high I/O loads. This server was intended to replace two older HP-UX servers. I had just begun to migrate the workload to the new server when the crashes began to occur. There are some minor, sporadic I/O loads but nothing that I would think could trigger the bug discussed in your link. However, I haven't measured the workload closely yet, so there could be spikes.
Best Regards,
*Dave Windsor*
Your error message, "Kernel panic - not syncing: Fatal exception" is too generic to give any clue. Do you see other error messages in your log ?
Did you run any hardware test (with Dell you have such utilities on DVD, I think they exist also on HP), to see if some hardware is failing, for example RAM ?
Alain
There are no error messages in any logs. For example, in /var/log/messages, everything looks normal until you see the kernel restart messages after the reboot, although there seems to be a long gap in time between the last entry and the time when the systems actually stopped and was restarted. Whatever is happening, the system doesn't seem to be in a state where the problem can be recorded.
By the way, I forgot to list my kernel version. uname -rmi gives: 2.6.18-194.32.1.el5 x86_64 x86_64
-- Best Regards,
Dave Windsor
Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us
Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com