[CentOS] how to debug hardware lockups?
Les Mikesell
lesmikesell at gmail.com
Tue Nov 18 13:02:55 UTC 2008
Rudi Ahlers wrote:
> On Sun, Nov 16, 2008 at 1:14 AM, John R Pierce <pierce at hogranch.com> wrote:
>> Rudi Ahlers wrote:
>>> Well, on a standard CentOS 5.2, /var/log/messages will be the the
>>> place to log problems like this, or where else can I get more info?
>>>
>> tough to write to the disk when the kernel is crashing. ditto the network.
>> that leaves VGAs and serial ports, which can be written to by self
>> contained emergency-crash routines...
>>
>> IIRC, you said this was a Q9something quad core... thats a desktop
>> processor... does this server have ECC memory? (I ask, because few desktop
>> platforms do, while ECC is fairly standard on servers). Without ECC, the
>> system has no way of knowing it read in bad data from the ram, and if the
>> bad data happens to be code and that code happens to be in the kernel,
>> ka-RASH, without any detection or warning, it leaps off into never-land, and
>> you get a kernel fault, almost always resulting in...
>>
>> kernel panic
>> system halted
>>
>> with no additional useful information available. with ECC memory, single
>> bit errors get corrected on the fly, and log an ECC error event, while
>> double bit errors result in a system halt with a message indicating such.
>>
>>
>
>
> No, the motherboard doesn't support ECC RAM. The motherboard is a
> Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.htm
I had machine that would crash about once every week or two in normal
operation. Memtest86+ found an error in the 2nd day of running. The
worst part was that it left the raid mirrors in a strange state that
caused occasional problems for months even after replacing the RAM.
--
Les Mikesell
lesmikesell at gmail.com
More information about the CentOS
mailing list