[CentOS] how to debug hardware lockups?

Tue Nov 18 13:02:55 UTC 2008

Rudi Ahlers wrote:
> On Sun, Nov 16, 2008 at 1:14 AM, John R Pierce <pierce at hogranch.com> wrote:
>> Rudi Ahlers wrote:
>>> Well, on a standard CentOS 5.2, /var/log/messages will be the the
>>> place to log problems like this, or where else can I get more info?
>>>
>> tough to write to the disk when the kernel is crashing.  ditto the network.
>>   that leaves VGAs and serial ports, which can be written to by self
>> contained emergency-crash routines...
>>
>> IIRC, you said this was a Q9something quad core... thats a desktop
>> processor... does this server have ECC memory?  (I ask, because few desktop
>> platforms do, while ECC is fairly standard on servers).    Without ECC, the
>> system has no way of knowing it read in bad data from the ram, and if the
>> bad data happens to be code and that code happens to be in the kernel,
>> ka-RASH, without any detection or warning, it leaps off into never-land, and
>> you get a kernel fault, almost always resulting in...
>>
>>   kernel panic
>>   system halted
>>
>> with no additional useful information available.     with ECC memory, single
>> bit errors get corrected on the fly, and log an ECC error event, while
>> double bit errors result in a system halt with a message indicating such.
>>
>>
> 
> 
> No, the motherboard doesn't support ECC RAM. The motherboard is a
> Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.htm

I had machine that would crash about once every week or two in normal 
operation. Memtest86+ found an error in the 2nd day of running.  The 
worst part was that it left the raid mirrors in a strange state that 
caused occasional problems for months even after replacing the RAM.

-- 
   Les Mikesell
     lesmikesell at gmail.com