[CentOS] how to debug hardware lockups?

Rob Lines rlinesseagate at gmail.com
Tue Nov 18 15:53:34 UTC 2008


On Tue, Nov 18, 2008 at 9:47 AM, Les Mikesell <lesmikesell at gmail.com> wrote:
>> Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles
>> would be good enough?
>>
>> I'm hoping to pick-up the server in the next 2 hours then I can see
>> what happens when I run memtest86+ or other tests
>
> Yes, apparently RAM errors can be subtle and only appear when certain
> adjacent bit patterns are stored - or when the moon is in a certain phase or
> something.
>
> --
>  Les Mikesell
>   lesmikesell at gmail.com

When we burn in machines to try to find errors we go with the day or
two run also.  The one fun thing that we found was that many times it
was temperature related.  It would crash in the rack but then when the
machine was removed to a test bench it would not exhibit the issue.
This is especially true when the machine under load would have both
the CPU and the memory taxed but then during the testing we could only
really tax one or the other using the existing tools.  So blocking a
bit of the air flow in the lab to heat up the case or being able to
test in the same data center environment helped a lot.

We have most errors show up either in the first 2 minutes of running a
memory test or using one the prime number calculations or it will take
a day or few to show up.

Rob



More information about the CentOS mailing list