On 3/24/2011 1:44 PM, Alain Péan wrote: > Le 24/03/2011 18:30, Dave Windsor a écrit : >> On 3/24/2011 12:37 PM, Alain Péan wrote: >>> Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit : >>>> <snipped> >>>> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc >>>> RIP [<ffff8100dc435cf0>] >>>> RSP<ffff81001529fd18> >>>> CR2: ffff8100dc435cf0 >>>> <0>Kernel panic - not syncing: Fatal exception >>>> >>>> <snipped> >>>> I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea? >>>> >>>> The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change. >>>> >>>> Any idea where I should look next? >>>> >>>> Thanks for any help anyone can provide! >>>> >>> The fact that it appears after two weeks or so reminds me of a bug I >>> saw on linux PowerEdge mailing list, //the "blocked for more than 120 >>> seconds" timeout bug. >>> I don't know if your problem is related, but if it is the case you >>> should see the message in your logs. >>> >>> Do you have any high IO load, at least at some moments, on your server ? >>> >>> See : >>> http://lists.us.dell.com/pipermail/linux-poweredge/2011-March/044515.html >>> >>> In this case, using a newer kernel would be indeed it seems a good idea. >>> >>> See if it can help... >>> >>> Alain >>> // >> Alain, >> >> Today, there are not high I/O loads. This server was intended to >> replace two older HP-UX servers. I had just begun to migrate the >> workload to the new server when the crashes began to occur. There are >> some minor, sporadic I/O loads but nothing that I would think could >> trigger the bug discussed in your link. However, I haven't measured the >> workload closely yet, so there could be spikes. >> >> Best Regards, >> >> *Dave Windsor* > > Your error message, "Kernel panic - not syncing: Fatal exception" is too > generic to give any clue. Do you see other error messages in your log ? > > Did you run any hardware test (with Dell you have such utilities on DVD, > I think they exist also on HP), to see if some hardware is failing, for > example RAM ? > > Alain > There are no error messages in any logs. For example, in /var/log/messages, everything looks normal until you see the kernel restart messages after the reboot, although there seems to be a long gap in time between the last entry and the time when the systems actually stopped and was restarted. Whatever is happening, the system doesn't seem to be in a state where the problem can be recorded. By the way, I forgot to list my kernel version. uname -rmi gives: 2.6.18-194.32.1.el5 x86_64 x86_64 -- Best Regards, Dave Windsor Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor at us.bosch.com