[CentOS] how to debug hardware lockups?

Tue Nov 18 23:56:36 UTC 2008
Ross Walker <rswwalker at gmail.com>

On Nov 18, 2008, at 6:05 PM, Les Mikesell <lesmikesell at gmail.com> wrote:

> nate wrote:
>> Les Mikesell wrote:
>>> Yes, apparently RAM errors can be subtle and only appear when  
>>> certain
>>> adjacent bit patterns are stored - or when the moon is in a certain
>>> phase or something.
>> Don't forget cosmic rays
>> http://adsabs.harvard.edu/abs/1978ITNS...25.1166P
>
> Yeah, but those don't stop when you replace the faulty RAM...  Mine  
> did, but the errors committed to disk kept randomly re-appearing  
> mysteriously as the reads from the RAID1 alternated afterwards.

Ah, memory mapped files, another very good reason to use ECC with  
large memory machines.

Also if you identify bad memory and use software RAID1, it's better to  
break the mirror, fsck and fix, then rebuild the mirror as there is no  
data integrity test on RAID1.

-Ross