[CentOS] kernel: Machine check events logged

Wed Jul 7 14:38:04 UTC 2010
m.roth at 5-cent.us <m.roth at 5-cent.us>

Alexander Farber wrote:
> I've only found this Solaris blog, but don't understand it well enough:
> http://blogs.sun.com/gavinm/entry/amd_opteron_athlon64_turion64_fault
> Can't provide you more details, because my dedicated server
> is under hoster's "hardware tests" since 5 hours :-(
> (and I guess everyone will run home for the Germany-Spain game soon)
First, that's solaris (or opensolaris), so it's not the same. Second,
you'll notice that the diagram and the table do *not* mention L3 caches,
so the architecture's a bit different.

Finally, note where the article says, "If an error is recoverable then it
does not raise a Machine Check Exception (MCE or mc#) when detected. The
recoverable errors, broadly speaking, are single-bit ECC errors from
ECC-protected arrays and parity errors on clean parity- <snip>
If an error is irrecoverable then detection of that error will raise a
machine check exception (if the bit that controls mc# for that error type
is set; if not you'll either never know or you pick it up by polling). The
mc# handler can extract information about the error from the machine check
architecture registers as before, but has the additional responsibility of
deciding what further actions (which may include panic and reboot) are
required. A machine check exception is a form of interrupt which allows
immediate notification of an error condition - you can't afford to wait to
poll for the error since that could result in the use of bad data and
associated data corruption.
--- end excerpt ---

So, it is, in fact, serious, and non-recoverable, so they have a problem
with their hardware, and you've paid for a service that they provide,
including hardware that's supposed to be up 99.<whatever you paid for>% of
the time. If they don't get it up, there should be penalties against them,
or at least money rebates to *you*.

There may also be limits that would mean they've broken the contract, and
are liable.

> Regards
> Alex
>>> > MCE 0
>>> > HARDWARE ERROR. This is *NOT* a software problem!
>>> > Please contact your hardware vendor
>>> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
>>> > uptime (unreliable)]
>>> > MISC c008000001000000 ADDR 1148f5940
>>> >   Northbridge NB Array Error
>>> >        bit35 = err cpu3
>>> >        bit42 = L3 subcache in error bit 0
>>> >        bit43 = L3 subcache in error bit 1
>>> >        bit46 = corrected ecc error
>>> >        bit59 = misc error valid
>>> >   memory/cache error 'generic read mem transaction, generic
>>> > transaction, level generic'
>>> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
>>> > No DIMM found for 1148f5940 in SMBIOS
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos