[CentOS] how to debug hardware lockups?

Tue Nov 18 15:53:34 UTC 2008

When we burn in machines to try to find errors we go with the day or
two run also.  The one fun thing that we found was that many times it
was temperature related.  It would crash in the rack but then when the
machine was removed to a test bench it would not exhibit the issue.
This is especially true when the machine under load would have both
the CPU and the memory taxed but then during the testing we could only
really tax one or the other using the existing tools.  So blocking a
bit of the air flow in the lab to heat up the case or being able to
test in the same data center environment helped a lot.

We have most errors show up either in the first 2 minutes of running a
memory test or using one the prime number calculations or it will take
a day or few to show up.


