[CentOS] ECC memory errors

Mon Apr 29 11:59:44 UTC 2013
mark <m.roth at 5-cent.us>

On 04/29/13 04:17, Peter Peltonen wrote:
> I started to receive this kind of messages a few days ago on one of my
> servers:
>
> Message from syslogd@ at Mon Apr 29 08:02:55 2013 ...
> server1 kernel: EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-":
> (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x2 (Aliased
> Uncorrectable Non-Mirrored Demand Data ECC))
>
> I've never had ECC memory to fail on me before, so now I am wondering the
> following:
>
> * The server is running CentOS 5.7 and is acting as Xen dom0. Is there any
> possibility this could be a kernel issue and upgrading would help, or would
> upgrading at this point just cause more trouble?

Not in my experience.
>
> * Is there now a possibility that my data can get corrupt: should I
> shutdown the server as soon as possible or can I keep running until I
> replace the memories?

Maybe - I'm just not sure. You need to replace the memory asap; order 
it, and schedule a maintenance window with all your users *now*.
>
> * This server has been running for several years in a datacenter without
> problems: what are your experiences, are these kind of problems most likely
> caused by a failing motherboard or the memories?

DIMM went bad. No big thing. Your only problem may be to identify which 
one, he says, about to go into work to do just that.

	mark


-- 
"Stock traders are a superstitious and cowardly lot", to paraphrase the 
Batman