Cant find out MCE reason (CPU 35 BANK 8)

List overview All Threads
Download

newer

older

Named, logging of requests

Can anyone post a working pppd...

Vladimir Budnev

21 Mar 2011 21 Mar '11

2:51 p.m.

Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason. "Ordinary" mce message looks like:

CPU 51 BANK 8 TSC 8511e3ca77dc MISC 274d587f00006141 ADDR 807044840 STATUS cc0055000001009f MCGSTATUS 0

decode with mcelog --ascii --cpu p4(cause there is no xeon56xx in list):

HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 53 BANK 8 TSC 1982d8f72b1f MISC e1742eac00006242 ADDR 7ffd78a80 MCG status: MCi status: Error overflow MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Transaction: Memory read error STATUS cc0002000001009f MCGSTATUS 0

The global question is it possible to find out the exact hw which causes those messages? First we thought that according to

/* A machine check record */ struct mce { __u64 status; /* bank status register */ __u64 misc; /* misc register (always 0 right now) */ __u64 addr; /* address or 0 */ __u64 mcgstatus; /* global MC status register */ __u64 rip; /* Program counter or 0 for silent error */ __u64 tsc; /* cpu time stamp counter */ __u64 res1; /* for future extension */ __u64 res2; /* dito. */ __u8 cs; /* code segment */ __u8 bank; /* machine check bank */ __u8 cpu; /* cpu that raised the error */ __u8 finished; /* entry is valid */ __u32 pad; };

cpu is the cpu rised the exception, but we have 2 quadro cpus with HT so maximum cpu number should be 16 and in logs we see 53 etc. So no we r not sure about what cpu value is :)Does anyone know what the CPU number means exactly?

One more interesting thins is the following output: [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq 32 33 34 35 50 51 52 53

Those numbers are always the same.

Ok.Supposed we have problem in RAM, since i dont really know what those cpu numbers mean we suppose that cpu+bank can point the problem hw.Is it possible? According to our "broken ram theory" we suppose that those numbers 32,33,34,45 and 50,51,52,53 indicate some simetric problem with ram/or slots or smth else.Is it correct?

Thanks in advance.

Attachments:

attachment.html (text/html — 2.6 KB)

Show replies by date

m.roth＠5-cent.us

21 Mar 21 Mar

3:12 p.m.

Vladimir Budnev wrote:

...

Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday.

"Normal" is not: *ANYTHING* here is Bad News. First, you've got DIMMs failing. CPU 53, assuming this system doesn't have 53+ physical CPUs, means that you have x-core systems, so you need to divide by x, so that if it's a 12-core system with 6 physical chips, that would make it DIMM 8 associated with that physical CPU. <snip>

...

One more interesting thins is the following output: [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq 32 33 34 35 50 51 52 53

Those numbers are always the same.

Bad news: you have *two* DIMMs failing, one associated with the physical CPU that has core 53, and another associated with the physical CPU that has cores 32-35.

Talk to your OEM support to help identify which banks need replacing, and/or find a motherboard diagram.

mark, who has to deal *again* with one machine with the same problem....

Vladimir Budnev

22 Mar 22 Mar

11:33 a.m.

2011/3/21 m.roth@5-cent.us

...

Vladimir Budnev wrote:

...
Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday.

"Normal" is not: *ANYTHING* here is Bad News. First, you've got DIMMs failing. CPU 53, assuming this system doesn't have 53+ physical CPUs, means that you have x-core systems, so you need to divide by x, so that if it's a 12-core system with 6 physical chips, that would make it DIMM 8 associated with that physical CPU.

<snip> > One more interesting thins is the following output: > [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq > 32 > 33 > 34 > 35 > 50 > 51 > 52 > 53 > > Those numbers are always the same.

Bad news: you have *two* DIMMs failing, one associated with the physical CPU that has core 53, and another associated with the physical CPU that has cores 32-35.

Talk to your OEM support to help identify which banks need replacing, and/or find a motherboard diagram.
     mark, who has to deal *again* with one machine with the same
problem....

Tnx for the asnwer!

Last night we'v made some research to find out which RAM modules bugged.

To be noticed we have 8 modules 4G each.

First we'v removed a3,b1 slots for each cpu, and there were no changes in HW behaviour. Errors appeared after boot.

Then we'v removed a1,a2 (yes i know that "for hight performance" we should place modules starting from a1 but it was our mistake and in any case server started) and ...and there were no errors during 1h. Usually we can observer errors coming ~every 5 mins.

Then we'v placed back 2 modules. At that step we had a1,a3,b1 slots occupied for each cpu. No errors.

Finally we'v placed last 2 modules...and no errors. It should be noticed that at that step we have exactly the same modules placement as before experiment.

Sounds strange, but at first glance looks like smthg was wrong with modules placement. But we cant realise why the problem didnt show for the first days, even month of server running. Noone touched server HW, so i have no idea what was that.

Now we are just waiting will there be errors again.

Nico Kadel-Garcia

11:48 a.m.

On Tue, Mar 22, 2011 at 7:33 AM, Vladimir Budnev vladimir.budnev@gmail.com wrote:

...

2011/3/21 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday.

"Normal" is not: *ANYTHING* here is Bad News. First, you've got DIMMs failing. CPU 53, assuming this system doesn't have 53+ physical CPUs, means that you have x-core systems, so you need to divide by x, so that if it's a 12-core system with 6 physical chips, that would make it DIMM 8 associated with that physical CPU.

<snip> > One more interesting thins is the following output: > [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq > 32 > 33 > 34 > 35 > 50 > 51 > 52 > 53 > > Those numbers are always the same.

Bad news: you have *two* DIMMs failing, one associated with the physical CPU that has core 53, and another associated with the physical CPU that has cores 32-35.

Talk to your OEM support to help identify which banks need replacing, and/or find a motherboard diagram.

mark, who has to deal *again* with one machine with the same problem....

Tnx for the asnwer!

Last night we'v made some research to find out which RAM modules bugged.

To be noticed we have 8 modules 4G each.

First we'v removed a3,b1 slots for each cpu, and there were no changes in HW behaviour. Errors appeared after boot.

Then we'v removed a1,a2 (yes i know that "for hight performance" we should place modules starting from a1 but it was our mistake and in any case server started) and ...and there were no errors during 1h. Usually we can observer errors coming ~every 5 mins.

Then we'v placed back 2 modules. At that step we had a1,a3,b1 slots occupied for each cpu. No errors.

Finally we'v placed last 2 modules...and no errors. It should be noticed that at that step we have exactly the same modules placement as before experiment.

Sounds strange, but at first glance looks like smthg was wrong with modules placement. But we cant realise why the problem didnt show for the first days, even month of server running. Noone touched server HW, so i have no idea what was that.

Now we are just waiting will there be errors again.

You know......

I once had a *whole rack* of blade servers, running CentOS, where someone decided to "save money" by buying the memory separately and replacing it in-house. Slews of memory errors started up pretty soon. and I wound up having to reseat all of it, run some memory testing tools against them, juggle the good memory with the bad memory to get working systems, replace DIMM's, etc., etc. We kept seeing failures over the next few months as part of the falling part of a bathtub curve.

I was furious that we'd "saved" perhaps 2 thousand bucks on RAM, overall, and completely burned a month of my time and made our clients *VERY* unhappy and come out looking like fools for not having this very expensive piece of kit working from day one.

In the process, though, some of the systems were repaired "permanently" by simply reseating the RAM. I did handle them carefully, cleaning the filters, removing any dust (of which there was very little, they were new) and checking all the cabling. I also cleaned up the airflow a bit by doing some recabling and relabeling, normal practice when I have a rack down and a chance to make sure things go where they shouuld.

And I *carefully* cleaned up the blood where I cut my hand on the heat sink on the one system. Maybe it was the blood sacrifice that appeased the gods on that server?

m.roth＠5-cent.us

1:42 p.m.

Vladimir Budnev wrote:

...

2011/3/21 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware,

sometime

...

...
soon, like yesterday.

<snip>

...

...
Bad news: you have *two* DIMMs failing, one associated with the physical CPU that has core 53, and another associated with the physical CPU that has cores 32-35.

<snip>

...

Last night we'v made some research to find out which RAM modules bugged.

To be noticed we have 8 modules 4G each.

<snip>

...

Finally we'v placed last 2 modules...and no errors. It should be noticed that at that step we have exactly the same modules placement as before experiment.

Sounds strange, but at first glance looks like smthg was wrong with modules placement. But we cant realise why the problem didnt show for

the first

...

days, even month of server running. Noone touched server HW, so i have no idea what was that.

Now we are just waiting will there be errors again.

I'm sure there will. Reseating the memory may have done something, but there will, I'll wager.

Here's a question out of left field: who was the manufacturer of the 4G DIMMs? Not Supermicro, but the DIMMs themselves?

mark

Vladimir Budnev

2:01 p.m.

2011/3/22 m.roth@5-cent.us

...

Vladimir Budnev wrote:

...
2011/3/21 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware,

sometime

...
...
soon, like yesterday.

<snip> >> Bad news: you have *two* DIMMs failing, one associated with the physical >> CPU that has core 53, and another associated with the physical CPU that >> has cores 32-35. <snip> > Last night we'v made some research to find out which RAM modules bugged. > > To be noticed we have 8 modules 4G each. <snip> > Finally we'v placed last 2 modules...and no errors. It should be noticed > that at that step we have exactly the same modules placement as before > experiment. > > Sounds strange, but at first glance looks like smthg was wrong with > modules placement. But we cant realise why the problem didnt show for the first > days, even month of server running. Noone touched server HW, so i have no > idea what was that. > > Now we are just waiting will there be errors again.

I'm sure there will. Reseating the memory may have done something, but there will, I'll wager.

mark, you are absolutely right :) Approximetely 1h ago errors appeared. They appeared only once since reboot, but they r back. Hi there :(

The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and 18,19,20,21.We definetely moved "broken" modules to another slots. Anyway bad dimm is really a good news for us instead of e.g. motherboard.

We are going to continue party this night or tomorrow morning, and determin which two modules are broken.

Is it possible to determine which physical dimms correspond to those cpus noticed in mce messagees? We have two rows of slots(6 slot for each row) one for cpu1 and second for cpu2. Used slots marked as cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.

I remeber that you adviced to divide cpu number on physical core count. We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on situation? I hope we will find those bustards ourselvs but hint would be great.

And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers" per each memory module(in our situation), why we see only 4 numbers and not 8 e.g. 0,1,2,3,4,5,6,7 ?

...

Here's a question out of left field: who was the manufacturer of the 4G DIMMs? Not Supermicro, but the DIMMs themselves?

This is Kingston KVR1333D3D4R9S/4G if i got the question

m.roth＠5-cent.us

2:15 p.m.

Vladimir Budnev wrote:

...

2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/21 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday.

<snip> >> Bad news: you have *two* DIMMs failing, one associated with the >> physical CPU that has core 53, and another associated with the

physical CPU

...

...
...
...
that has cores 32-35.

<snip, memory reseating>

...
Now we are just waiting will there be errors again.

I'm sure there will. Reseating the memory may have done something, but there will, I'll wager.

mark, you are absolutely right :) Approximetely 1h ago errors appeared. They appeared only once since reboot, but they r back. Hi there :(

The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and 18,19,20,21.We definetely moved "broken" modules to another slots. Anyway bad dimm is really a good news for us instead of e.g. motherboard.

<snip>

...

Is it possible to determine which physical dimms correspond to those cpus noticed in mce messagees? We have two rows of slots(6 slot for each row) one for cpu1 and second for cpu2. Used slots marked as cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.

I remeber that you adviced to divide cpu number on physical core count. We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on situation? I hope we will find those bustards ourselvs but hint would be great.

And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers" per each memory module(in our situation), why we see only 4 numbers and not 8 e.g. 0,1,2,3,4,5,6,7 ?

I'm now confused about a lot: originally, you mentioned 53 - 57, was it? That doesn't add up, since you say you have 2 quad core processors, for a total of 8 cpus, and each of those processors have 6 banks, which would mean each processor should only see six (directly). Where I'm confused is how you could have cores 32-35, or 53-whatsit, when you only have 8 cores in two processors.

...

...
Here's a question out of left field: who was the manufacturer of the 4G DIMMs? Not Supermicro, but the DIMMs themselves?

This is Kingston KVR1333D3D4R9S/4G if i got the question

Oh, ok. I was wondering if they were Hynix - I've seen a good number of bad 4G and 8G DIMMs from them recently, and that across three different OEMs and model DIMMs.

mark

Vladimir Budnev

2:24 p.m.

2011/3/22 m.roth@5-cent.us

...

Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/21 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday.

<snip> >> Bad news: you have *two* DIMMs failing, one associated with the >> physical CPU that has core 53, and another associated with the

physical CPU

...
...
...
...
that has cores 32-35.

<snip, memory reseating>

...
Now we are just waiting will there be errors again.

I'm sure there will. Reseating the memory may have done something, but there will, I'll wager.

mark, you are absolutely right :) Approximetely 1h ago errors appeared. They appeared only once since reboot, but they r back. Hi there :(

The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and 18,19,20,21.We definetely moved "broken" modules to another slots. Anyway bad dimm is really a good news for us instead of e.g.

motherboard.

<snip> > Is it possible to determine which physical dimms correspond to those cpus > noticed in mce messagees? We have two rows of slots(6 slot for each row) > one for cpu1 and second for cpu2. Used slots marked as > cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1. > > I remeber that you adviced to divide cpu number on physical core count. We > have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on > situation? I hope we will find those bustards ourselvs but hint would be > great. > > And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers" > per each memory module(in our situation), why we see only 4 numbers and > not 8 e.g. 0,1,2,3,4,5,6,7 ?

I'm now confused about a lot: originally, you mentioned 53 - 57, was it? That doesn't add up, since you say you have 2 quad core processors, for a total of 8 cpus, and each of those processors have 6 banks, which would mean each processor should only see six (directly). Where I'm confused is how you could have cores 32-35, or 53-whatsit, when you only have 8 cores in two processors.

2 cpu each 8 cores and HT support. So 16 at max i think. for such way is it ok? I really lost the idea line with those cpu to memory bank mappings...

...

...
...
Here's a question out of left field: who was the manufacturer of the 4G DIMMs? Not Supermicro, but the DIMMs themselves?

This is Kingston KVR1333D3D4R9S/4G if i got the question

Oh, ok. I was wondering if they were Hynix - I've seen a good number of bad 4G and 8G DIMMs from them recently, and that across three different OEMs and model DIMMs.
    mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

m.roth＠5-cent.us

2:43 p.m.

Vladimir Budnev wrote:

...

2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/21 m.roth@5-cent.us

...
Vladimir Budnev wrote: > > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G > > For some time we have lots of MCE in mcelog and we cant find out > the reason.

The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday.

<snip>

We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or

depends on

...

...
...
situation? I hope we will find those bustards ourselvs but hint would be great.

And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers" per each memory module(in our situation), why we see only 4

numbers

...

...
...
and not 8 e.g. 0,1,2,3,4,5,6,7 ?

I'm now confused about a lot: originally, you mentioned 53 - 57, was it? That doesn't add up, since you say you have 2 quad core processors, for a total of 8 cpus, and each of those processors have 6 banks, which would mean each processor should only see six (directly). Where I'm confused is how you could have cores 32-35, or 53-whatsit, when you only have 8 cores in two processors.

2 cpu each 8 cores and HT support. So 16 at max i think. for such way is it ok?

Huh? Above, you say "2 quad core proc" - that's 8 cores over two processor chips. HT support doesn't figure into it; if you use dmidecode or lshw, I believe it will show you 8 cores, not 16.

...

I really lost the idea line with those cpu to memory bank mappings...

Each processor will directly see the DIMMs associate with it, so that the banks associated with each processor will be what directly affects the cores. So, if you see something like Mar 20 05:01:35 <system name> kernel: Northbridge Error, node 0, core: 5 (these processors are 8-core), it means that one of the DIMMs in bank 0, 0-3, is bad. You should see __ |_0| 0 1 2 3 __ |_1| 0 1 2 3

or whatever on the m/b, so one of the top ones there is affected. Is that any clearer?

mark

Vladimir Budnev

2:59 p.m.

2011/3/22 m.roth@5-cent.us

...

Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/21 m.roth@5-cent.us > Vladimir Budnev wrote: > > > > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with > > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G > > > > For some time we have lots of MCE in mcelog and we cant find out > > the reason. > > The only thing that shows there (when it shows, since sometimes it > doesn't seem to) is a hardware error. You *WILL* be replacing > hardware, sometime soon, like yesterday.

<snip>

We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or

depends on

...
...
...
situation? I hope we will find those bustards ourselvs but hint would be great.

And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers" per each memory module(in our situation), why we see only 4

numbers

...
...
...
and not 8 e.g. 0,1,2,3,4,5,6,7 ?

I'm now confused about a lot: originally, you mentioned 53 - 57, was it? That doesn't add up, since you say you have 2 quad core processors, for a total of 8 cpus, and each of those processors have 6 banks, which

would

...
...
mean each processor should only see six (directly). Where I'm confused is how you could have cores 32-35, or 53-whatsit, when you only have 8 cores in two processors.

2 cpu each 8 cores and HT support. So 16 at max i think. for such way is it ok?

Huh? Above, you say "2 quad core proc" - that's 8 cores over two processor chips. HT support doesn't figure into it; if you use dmidecode or lshw, I believe it will show you 8 cores, not 16.

Was a typo, sorry. 2 CPU and each one has 4 cores so totally 8 cores.

...

...
I really lost the idea line with those cpu to memory bank mappings...

Each processor will directly see the DIMMs associate with it, so that the banks associated with each processor will be what directly affects the cores. So, if you see something like Mar 20 05:01:35 <system name> kernel: Northbridge Error, node 0, core: 5 (these processors are 8-core), it means that one of the DIMMs in bank 0, 0-3, is bad. You should see __ |_0| 0 1 2 3 __ |_1| 0 1 2 3

or whatever on the m/b, so one of the top ones there is affected. Is that any clearer?

First of all big thnx for helping mark.

In your example everything is ok. But i am lost with what we have. Previously we recieved messages like i post in the first mail: CPU 51 BANK 8 TSC 8511e3ca77dc MISC 274d587f00006141 ADDR 807044840 STATUS cc0055000001009f MCGSTATU

And always there were same cpu numbers. I really dont know why do mcleog show such numbers but thats what we have.Always Bank 8 and there were 32,33,34,45 and 50,51,52,53 numbers in CPU field.

You encouraged us that it is a dimm problem and we decide to make a little research which i described up the thread. During that wev replaced DIMM moduels between slots, so now we have BANK 8 and cpu 1,2,3 and 18,29,20,21. It really seems that some how those numbers connected with RAM modules.

But... as i sad we have following slots CPU1 cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3 CPU2 cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3

We have modules placed in such way: +------------+------------+------------+------------+------------+------------+------------+ | | V | V | V | V | free | free | +------------+------------+------------+------------+------------+------------+------------+ | CPU1 | cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 | +------------+------------+------------+------------+------------+------------+------------+

+------------+------------+------------+------------+------------+------------+------------+ | | V | V | V | V | free | free | +------------+------------+------------+------------+------------+------------+------------+ | CPU2 | cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 | +------------+------------+------------+------------+------------+------------+------------+

Definetely there is something with memory banks,becasue replacinbg moudels changed the mce messages, but what exactly...or iv interpreted all wrong?

Rafa Griman

3:12 p.m.

Hi :)

On Tue, Mar 22, 2011 at 3:59 PM, Vladimir Budnev vladimir.budnev@gmail.com wrote:

[...]

...

But... as i sad we have following slots    CPU1    cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3    CPU2    cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3

We have modules placed in such way: +------------+------------+------------+------------+------------+------------+------------+ |              |      V     |     V      |      V     |    V     | free    |    free    | +------------+------------+------------+------------+------------+------------+------------+ |   CPU1 | cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 | +------------+------------+------------+------------+------------+------------+------------+

+------------+------------+------------+------------+------------+------------+------------+ |              |      V     |     V      |      V     |    V     | free    |    free    | +------------+------------+------------+------------+------------+------------+------------+ |   CPU2 | cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 | +------------+------------+------------+------------+------------+------------+------------+

Definetely there is something with memory banks,becasue replacinbg moudels changed the mce messages, but what exactly...or iv interpreted all wrong?

This isn't an optimal setup (performance-wise). You should always populate complete slots in multiples of 3 to get the full bandwidth. In your case, you've got cpu1-b[2|3] and cpu2-b[2|3] with no DIMMs so that would affect your performance.

HTH

Rafa

m.roth＠5-cent.us

3:14 p.m.

Vladimir Budnev wrote:

...

2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote: > 2011/3/21 m.roth@5-cent.us >> Vladimir Budnev wrote: >> > >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with >> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G >> >

The next thing you should do, if you don't have them, is go to http://www.supermicro.com/support/manuals/ and d/l the manual, and see what it says about DIMMs.

mark

Vladimir Budnev

3:26 p.m.

2011/3/22 m.roth@5-cent.us

...

Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us > Vladimir Budnev wrote: > > 2011/3/21 m.roth@5-cent.us > >> Vladimir Budnev wrote: > >> > > >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with > >> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G > >> >

The next thing you should do, if you don't have them, is go to http://www.supermicro.com/support/manuals/ and d/l the manual, and see what it says about DIMMs.

If you meaned to check whether those DIMM modules a compatible with mother board , its ok. Kingstin KVR1333D3D4R9S is in tested list http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0&ms...

And can you say something about cpu wild numbers and determing which dimms are bugged? didnt you mean some post ago that on x core system we must divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores ->4 slot?

At that moment we'v removed 2 modules and monitoring for the result.

m.roth＠5-cent.us

3:36 p.m.

Vladimir Budnev wrote:

...

2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote: > 2011/3/22 m.roth@5-cent.us >> Vladimir Budnev wrote: >> > 2011/3/21 m.roth@5-cent.us >> >> Vladimir Budnev wrote: >> >> > >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF

with

...
...
...
...
>> >> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G >> >> >

The next thing you should do, if you don't have them, is go to http://www.supermicro.com/support/manuals/ and d/l the manual, and see what it says about DIMMs.

If you meaned to check whether those DIMM modules a compatible with mother board , its ok. Kingstin KVR1333D3D4R9S is in tested list http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0&ms...

No, what you need to see is a) whether what you did was valid (for the Supermicro m/b on the server I'm working on right now, the manual says the a-banks must *ALWAYS* be populated...), and b) you might find some troubleshooting info to help you identify which DIMMs are the problem.

...

And can you say something about cpu wild numbers and determing which dimms are bugged? didnt you mean some post ago that on x core system we must divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores ->4 slot?

Nope. From your original post:

...

...
...
One more interesting thins is the following output:

[root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq 32 33 34 35 50 51 52 53

So with 2 4-core Xeons, I don't understand how you can get 3x and 5x. Could you post some raw messages, either from /var/log/message or from /var/log/mcelog?

mark

Vladimir Budnev

3:40 p.m.

2011/3/22 m.roth@5-cent.us

...

Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us > Vladimir Budnev wrote: > > 2011/3/22 m.roth@5-cent.us > >> Vladimir Budnev wrote: > >> > 2011/3/21 m.roth@5-cent.us > >> >> Vladimir Budnev wrote: > >> >> > > >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF

with

...
...
...
> >> >> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G > >> >> >

The next thing you should do, if you don't have them, is go to http://www.supermicro.com/support/manuals/ and d/l the manual, and

see

...
...
what it says about DIMMs.

If you meaned to check whether those DIMM modules a compatible with

mother

...
board , its ok. Kingstin KVR1333D3D4R9S is in tested list

http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0&ms...

...
No, what you need to see is a) whether what you did was valid (for the Supermicro m/b on the server I'm working on right now, the manual says the a-banks must *ALWAYS* be populated...), and b) you might find some troubleshooting info to help you identify which DIMMs are the problem.

Roger that. Our bad :(

...

...
And can you say something about cpu wild numbers and determing which

dimms

...
are bugged? didnt you mean some post ago that on x core system we must divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores

->4

...
slot?

Nope. From your original post:

...
...
...
One more interesting thins is the following output:

[root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq 32 33 34 35 50 51 52 53

So with 2 4-core Xeons, I don't understand how you can get 3x and 5x. Could you post some raw messages, either from /var/log/message or from /var/log/mcelog?

sure here they are before "night party": MCE 24 CPU 52 BANK 8 TSC 372a290717a MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 25 CPU 32 BANK 8 TSC 372a29073cb MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 26 CPU 50 BANK 8 TSC 372a29064ca MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 27 CPU 33 BANK 8 TSC 372a2907e5c MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 28 CPU 35 BANK 8 TSC 372a29088f1 MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 29 CPU 53 BANK 8 TSC 372a2908e82 MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 30 CPU 51 BANK 8 TSC 372a290899f MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 31 CPU 34 BANK 8 TSC 423243c7aa5 MISC 2275a96d0000098f ADDR 7e7540ac0 STATUS cc001f000001009f MCGSTATUS 0

and here after:

MCE 0 CPU 18 BANK 8 TSC 608709adcc62 MISC c6673a0400001181 ADDR 2f4cf4f40 STATUS cc0000800001009f MCGSTATUS 0 MCE 1 CPU 2 BANK 8 TSC 608709adcbcb MISC c6673a0400001181 ADDR 2f4cf4f40 STATUS cc0000800001009f MCGSTATUS 0 MCE 2 CPU 20 BANK 8 TSC 608709adcb59 MISC c6673a0400001181 ADDR 2f4cf4f40 STATUS cc0000800001009f MCGSTATUS 0 MCE 3 CPU 1 BANK 8 TSC 608709add9b0 MISC c6673a0400001181 ADDR 2f4cf4f40 STATUS cc0000800001009f MCGSTATUS 0 MCE 4 CPU 3 BANK 8 TSC 608709ade3ab MISC c6673a0400001181 ADDR 2f4cf4f40 STATUS cc0000800001009f MCGSTATUS 0 MCE 5 CPU 19 BANK 8 TSC 608709ade850 MISC c6673a0400001181 ADDR 2f4cf4f40 STATUS cc0000800001009f MCGSTATUS 0 MCE 6 CPU 21 BANK 8 TSC 608709ade4ea MISC c6673a0400001181 ADDR 2f4cf4f40 STATUS cc0000800001009f MCGSTATUS 0

m.roth＠5-cent.us

3:50 p.m.

Vladimir Budnev wrote:

...

2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote: > 2011/3/22 m.roth@5-cent.us >> Vladimir Budnev wrote: >> > 2011/3/22 m.roth@5-cent.us >> >> Vladimir Budnev wrote: >> >> > 2011/3/21 m.roth@5-cent.us >> >> >> Vladimir Budnev wrote: >> >> >> > >> >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF << >> >> > with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G >> >> >> >

The next thing you should do, if you don't have them, is go to http://www.supermicro.com/support/manuals/ and d/l the manual, and see what it says about DIMMs.

If you meaned to check whether those DIMM modules a compatible with motherboard , its ok. Kingstin KVR1333D3D4R9S is in tested list

http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0&ms...

...
No, what you need to see is a) whether what you did was valid (for the Supermicro m/b on the server I'm working on right now, the manual says the a-banks must *ALWAYS* be populated...), and b) you might find some troubleshooting info to help you identify which DIMMs are the problem.

Roger that. Our bad :(

Std. sysadmin reply: RTFM! <g>

...

...
...
And can you say something about cpu wild numbers and determing which dimms are bugged? didnt you mean some post ago that on x core system

we must

...

...
...
divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores

->4 slot?

<snip>

...

...
So with 2 4-core Xeons, I don't understand how you can get 3x and 5x. Could you post some raw messages, either from /var/log/message or from /var/log/mcelog?

sure here they are before "night party": MCE 24 CPU 52 BANK 8 TSC 372a290717a MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 25

<snip> At this point, I throw up my hands. I have *no* idea how they could get numbers like CPU 52, unless something's wrong in the o/s - I mean, you are running 64 bit, right?

mark

Vladimir Budnev

3:55 p.m.

2011/3/22 m.roth@5-cent.us

...

Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us > Vladimir Budnev wrote: > > 2011/3/22 m.roth@5-cent.us > >> Vladimir Budnev wrote: > >> > 2011/3/22 m.roth@5-cent.us > >> >> Vladimir Budnev wrote: > >> >> > 2011/3/21 m.roth@5-cent.us > >> >> >> Vladimir Budnev wrote: > >> >> >> > > >> >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF > << >> >> > with 2xIntel Xeon E5630 and 8xKingston

KVR1333D3D4R9S/4G

...
...
...
...
...
> >> >> >> >

The next thing you should do, if you don't have them, is go to http://www.supermicro.com/support/manuals/ and d/l the manual, and see what it says about DIMMs.

If you meaned to check whether those DIMM modules a compatible with motherboard , its ok. Kingstin KVR1333D3D4R9S is in tested list

http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0&ms...

...
...
...
No, what you need to see is a) whether what you did was valid (for the Supermicro m/b on the server I'm working on right now, the manual says the a-banks must *ALWAYS* be populated...), and b) you might find some troubleshooting info to help you identify which DIMMs are the problem.

Roger that. Our bad :(

Std. sysadmin reply: RTFM! <g>

...
...
...
And can you say something about cpu wild numbers and determing which dimms are bugged? didnt you mean some post ago that on x core system

we must

...
...
...
divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores

->4 slot?

<snip> >> So with 2 4-core Xeons, I don't understand how you can get 3x and 5x. >> Could you post some raw messages, either from /var/log/message or from >> /var/log/mcelog? >> > > sure here they are before "night party": > MCE 24 > CPU 52 BANK 8 TSC 372a290717a > MISC 68651f800001186 ADDR 7dd2ad840 > STATUS cc0002800001009f MCGSTATUS 0 > MCE 25 <snip> At this point, I throw up my hands. I have *no* idea how they could get numbers like CPU 52, unless something's wrong in the o/s - I mean, you are running 64 bit, right?

Yeah, x86_64 I have an idea dunno....the thing is we r runngin 4.8 centos. Its old enough and mcelog version is old enough also, mb it decodes something completely wrong. Anyway thanks so much for your time and answers. Hope we will find those dimms in experiments.

m.roth＠5-cent.us

4 p.m.

Vladimir Budnev wrote:

...

2011/3/22 m.roth@5-cent.us

<CHOMP>

...

...
...
...
So with 2 4-core Xeons, I don't understand how you can get 3x and 5x. Could you post some raw messages, either from /var/log/message or from /var/log/mcelog?

sure here they are before "night party": MCE 24 CPU 52 BANK 8 TSC 372a290717a MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 25

<snip> At this point, I throw up my hands. I have *no* idea how they could get numbers like CPU 52, unless something's wrong in the o/s - I mean, you are running 64 bit, right?

Yeah, x86_64 I have an idea dunno....the thing is we r runngin 4.8 centos. Its old enough and mcelog version is old enough also, mb it decodes something

completely

...

wrong.

It could be that 4.8 doesn't really understand the CPU.

...

Anyway thanks so much for your time and answers. Hope we will find those dimms in experiments.

Seriously - how old is this? I think you should call your vendor: some will give you phone or email support, even after the end of warranty.

mark

Vladimir Budnev

20 Apr 20 Apr

9:01 a.m.

On 03/22/11 19:00, m.roth@5-cent.us wrote:

...

Vladimir Budnev wrote:

...
2011/3/22m.roth@5-cent.us

<CHOMP>

...
...
...
...
So with 2 4-core Xeons, I don't understand how you can get 3x and 5x. Could you post some raw messages, either from /var/log/message or from /var/log/mcelog?

sure here they are before "night party": MCE 24 CPU 52 BANK 8 TSC 372a290717a MISC 68651f800001186 ADDR 7dd2ad840 STATUS cc0002800001009f MCGSTATUS 0 MCE 25

<snip> At this point, I throw up my hands. I have *no* idea how they could get numbers like CPU 52, unless something's wrong in the o/s - I mean, you are running 64 bit, right?

Yeah, x86_64 I have an idea dunno....the thing is we r runngin 4.8 centos. Its old enough and mcelog version is old enough also, mb it decodes something

completely

...
wrong.

It could be that 4.8 doesn't really understand the CPU.

...
Anyway thanks so much for your time and answers. Hope we will find those dimms in experiments.

Seriously - how old is this? I think you should call your vendor: some will give you phone or email support, even after the end of warranty.
      mark

Forgot to write our solution, mb it will be usefull for someone. In our case the problem was(as expected) in DIMM modules. After replacing no more scare mcelogs e.t.c.

Charles Polisher

23 Mar 23 Mar

1:03 a.m.

m.roth@5-cent.us wrote:

...

Vladimir Budnev wrote:

...
2011/3/22 m.roth@5-cent.us

...
Vladimir Budnev wrote:

...
2011/3/21 m.roth@5-cent.us

...
Vladimir Budnev wrote:

<snip, memory reseating>

...
Now we are just waiting will there be errors again.

I'm sure there will. Reseating the memory may have done something, but there will, I'll wager.

mark, you are absolutely right :) Approximetely 1h ago errors appeared. They appeared only once since reboot, but they r back. Hi there :(

Here's a guess why you're having this problem: http://lmgtfy.com/?q=RAM+latent+junction+failure I suspect you're going to have problems again in a month or so. I hope I'm wrong.

-- Charles Polisher

5472

Age (days ago)

5502

Last active (days ago)

discuss@lists.centos.org

19 comments

5 participants

tags (0)

participants (5)

Charles Polisher
m.roth＠5-cent.us
Nico Kadel-Garcia
Rafa Griman
Vladimir Budnev