Alexander Farber wrote: > I've only found this Solaris blog, but don't understand it well enough: > http://blogs.sun.com/gavinm/entry/amd_opteron_athlon64_turion64_fault > > Can't provide you more details, because my dedicated server > is under hoster's "hardware tests" since 5 hours :-( > (and I guess everyone will run home for the Germany-Spain game soon) > First, that's solaris (or opensolaris), so it's not the same. Second, you'll notice that the diagram and the table do *not* mention L3 caches, so the architecture's a bit different. Finally, note where the article says, "If an error is recoverable then it does not raise a Machine Check Exception (MCE or mc#) when detected. The recoverable errors, broadly speaking, are single-bit ECC errors from ECC-protected arrays and parity errors on clean parity- <snip> If an error is irrecoverable then detection of that error will raise a machine check exception (if the bit that controls mc# for that error type is set; if not you'll either never know or you pick it up by polling). The mc# handler can extract information about the error from the machine check architecture registers as before, but has the additional responsibility of deciding what further actions (which may include panic and reboot) are required. A machine check exception is a form of interrupt which allows immediate notification of an error condition - you can't afford to wait to poll for the error since that could result in the use of bad data and associated data corruption. --- end excerpt --- So, it is, in fact, serious, and non-recoverable, so they have a problem with their hardware, and you've paid for a service that they provide, including hardware that's supposed to be up 99.<whatever you paid for>% of the time. If they don't get it up, there should be penalties against them, or at least money rebates to *you*. There may also be limits that would mean they've broken the contract, and are liable. mark > Regards > Alex > >>> > MCE 0 >>> > HARDWARE ERROR. This is *NOT* a software problem! >>> > Please contact your hardware vendor >>> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 >>> > uptime (unreliable)] >>> > MISC c008000001000000 ADDR 1148f5940 >>> > Northbridge NB Array Error >>> > bit35 = err cpu3 >>> > bit42 = L3 subcache in error bit 0 >>> > bit43 = L3 subcache in error bit 1 >>> > bit46 = corrected ecc error >>> > bit59 = misc error valid >>> > memory/cache error 'generic read mem transaction, generic >>> > transaction, level generic' >>> > STATUS 9c1f4cf8001c011b MCGSTATUS 0 >>> > No DIMM found for 1148f5940 in SMBIOS > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >