We have been seeing failures with CentOS 4.4 i386 (not x86_64) running compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, running Opteron 265's. This motherboard is used in the Tyan barebones box GT24 (B2881). We have these boards populated with 8GB of RAM, consisting of mixed 2GB and 1GB sticks.
The symptom is that CPU-bound programs (may or may not be related to floating point) fail randomly and intermittently, with wrong answers or segfaults. Running several in parallel seems to make the failures more likely. We have not seen any kernel crashes. It is not hard to reproduce the problem with some internal programs we have; it takes only a few minutes.
This is using a completely-up-to-date-as-of-yesterday CentOS 4.4 i386, hugemem or not doesn't make a difference. We have seen this on many boxes, so it's not bad memory. We do NOT see this problem if we run CentOS 4.4 x86_64 on the same boxes, using the same 32-bit test executables. We also don't see this problem on some slightly older boxes with Tyan K8SD motherboards running CentOS 4.4 i386 (also Opteron 265's, with 8GB of 1GB DIMMs).
We have been looking at BIOS settings, but haven't seen anything that stands out. memtest86 does not show errors.
Thanks for any suggestions of what this issue might be, Dan
Dan Halbert wrote:
We have been seeing failures with CentOS 4.4 i386 (not x86_64) running compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, running Opteron 265's. This motherboard is used in the Tyan barebones box GT24 (B2881). We have these boards populated with 8GB of RAM, consisting of mixed 2GB and 1GB sticks.
The symptom is that CPU-bound programs (may or may not be related to floating point) fail randomly and intermittently, with wrong answers or segfaults. Running several in parallel seems to make the failures more likely. We have not seen any kernel crashes. It is not hard to reproduce the problem with some internal programs we have; it takes only a few minutes.
as a FPU test, try this... from a user account...
mkdir mprime cd mprime wget ftp://mersenne.org/gimps/mprime2414.tar.gz tar xzvf mprime2414.tar.gz ./mprime -A0 -t & ./mprime -A1 -t &
(if you have two dual core opterons, do this twice more with -A2 and -A3)
this will HAMMER the cpu/cache/memory bus with intensive FPU operations. let it run all night on an otherwise idle box, note any errors spewed to the terminal. each instance will use about 16MB of ram, and will be executing near peak speed FPU/SSE operations. it auto-nice's itself to minimize the impact on the rest of the system. your CPUs will run hotter than they've ever run before :)
hey, I thought mixing dimm sizes was verbotten on opterons?
chrism@imntv.com wrote:
John R Pierce wrote:
hey, I thought mixing dimm sizes was verbotten on opterons?
Me too. Though maybe he means some boxes are using one type and other boxes are using the other type.
Cheers,
me three,
i've watched the boxes go into full rebellion doing that even with all same sizes and just mixing ram vendors, like sneaking a couple microns onto the board when it's all kingstons. nothing like a nice 2 hour return trip to the colo to fix a whiny javagroups build.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Karl R. Balsmeier wrote:
chrism@imntv.com wrote:
John R Pierce wrote:
hey, I thought mixing dimm sizes was verbotten on opterons?
Me too. Though maybe he means some boxes are using one type and other boxes are using the other type.
Cheers,
me three,
i've watched the boxes go into full rebellion doing that even with all same sizes and just mixing ram vendors, like sneaking a couple microns onto the board when it's all kingstons. nothing like a nice 2 hour return trip to the colo to fix a whiny javagroups build.
Ain't that the truth....I just pay the extra $$$ and stick with Micron/Crucial these days. I've been bitten by mixed memory more than once....sigh.
Cheers,
hey, I thought mixing dimm sizes was verbotten on opterons?
Thanks for everyone's comments. I consulted with the person who has actually been working on the hardware. He says these motherboards have six slots for DIMM's, not 8, but they are supposedly in GT24 chassis, which come with K8SRE boards. So it's not clear what's going on; I may be misreporting the motherboards. There are mixed sizes per board: 4 1GB and 2 2GB. During testing, we did try removing the 2GB DIMM's, but did not rebalance the memory. We are physically away from the machines, so we'll take a closer look in the morning, and try some other configurations. The x86_64 vs. i386 difference is odd; maybe they treat the memory controllers differently?
Where would I find official warnings about memory-size mixing? We can bring it up with the vendor. The motherboard manual doesn't have such caveats. Is this oral tradition?
Dan
Dan Halbert wrote:
hey, I thought mixing dimm sizes was verbotten on opterons?
Thanks for everyone's comments. I consulted with the person who has actually been working on the hardware. He says these motherboards have six slots for DIMM's, not 8, but they are supposedly in GT24 chassis, which come with K8SRE boards. So it's not clear what's going on; I may be misreporting the motherboards. There are mixed sizes per board: 4 1GB and 2 2GB. During testing, we did try removing the 2GB DIMM's, but did not rebalance the memory. We are physically away from the machines, so we'll take a closer look in the morning, and try some other configurations. The x86_64 vs. i386 difference is odd; maybe they treat the memory controllers differently?
Where would I find official warnings about memory-size mixing? We can bring it up with the vendor. The motherboard manual doesn't have such caveats. Is this oral tradition?
my HP DL series Opteron servers have a caveat that you CAN mix memories sizes, as long as all the memory on one CPU is the same. Also, on the boards that have 8 dimms per CPU, if you load more than 4 dimms (2 banks of dual channel) on a single CPU's memory banks, you have to slow the memory bus down a notch. Now, I just realized, this is an Opteron 8xx server, 4 CPU sockets, 4 sets of 8 dimms each, its possible the Opteron 2xx series (dual CPU sockets) have different memory rules. As far as I know, these rules are from the CPU chips themselves, as the CPUs have the memory controllers integrated.
+-----+ +-----+ cpu0 ======| CPU |--hypertransport--| CPU |=====cpu1 membanks ======| 0 | | 1 |=====membanks +-----+ +-----+ | |hypertransport to IO controllers | +-----+ | IO | | bus | +-----+
(where the 'IO bus' is the main board chipset that manages all IO busses on the system)
anyways... perusing the manual on the k8sre... hmmm. its not specified at all, and there's 2 x 2 slots on each processor, which does look like the 2xx have a different sort of memory controller than the 8xx
configurations. The x86_64 vs. i386 difference is odd; maybe they treat the memory controllers differently?
Maybe lack of AMD IOMMU support in the i386/i686 kernel?
Where would I find official warnings about memory-size mixing? We can bring it up with the vendor. The motherboard manual doesn't have such caveats. Is this oral tradition?
I had problems POSTing on a Tyan 2881 with two 242 Opterons using RAM from the same vendor (4 sticks of 512MB) because two of them were a different batch. After getting those replaced with verified modules did I get my 2GB worth of RAM.
It turns out we were misinformed by an incorrect packing slip. I apologize for the red herring. These are actually Tyan GT20 boxes with Tyan K8SSA (S3870) motherboards. They have 6 DIMM slots and use a ServerWorks BCM5785 (HT-1000) chipset. The config of DIMM slots is kind of weird: there are four nominally supporting the first processor and two supporting the second, though the processors can read from each other's slots.
As for mixing RAM, we did try 4x1GB of identical DIMM's, and got the same problems with i386 CentOS 4.4 . x86_64 is fine.
Dan
Dan Halbert wrote:
It turns out we were misinformed by an incorrect packing slip. I apologize for the red herring. These are actually Tyan GT20 boxes with Tyan K8SSA (S3870) motherboards. They have 6 DIMM slots and use a ServerWorks BCM5785 (HT-1000) chipset. The config of DIMM slots is kind of weird: there are four nominally supporting the first processor and two supporting the second, though the processors can read from each other's slots.
As for mixing RAM, we did try 4x1GB of identical DIMM's, and got the same problems with i386 CentOS 4.4 . x86_64 is fine.
I don't have any experience with that particular board or chipset. Sorry. Tyan tech support is going to be pretty worthless so I hope you're able to track down the cause of the problem. I'd be curious to know what the root problem is when you get things sorted.
Cheers,
I don't have any experience with that particular board or chipset. Sorry. Tyan tech support is going to be pretty worthless so I hope you're able to track down the cause of the problem. I'd be curious to know what the root problem is when you get things sorted.
Really? I got pretty good support from Tyan to diagnose the incompatible RAM modules I got...that was three years ago. Have they gone bad?
Feizhou wrote:
I don't have any experience with that particular board or chipset. Sorry. Tyan tech support is going to be pretty worthless so I hope you're able to track down the cause of the problem. I'd be curious to know what the root problem is when you get things sorted.
Really? I got pretty good support from Tyan to diagnose the incompatible RAM modules I got...that was three years ago. Have they gone bad?
I had a really hard time with them when trying to troubleshoot issues with the 2895 board and the Highpoint Rocketraid 2224. Highpoint would email me replies within the hour and returned a few phone calls. Emails and phone calls to Tyan went unanswered for about 10 days at which point I got an extremely unhelpful canned email response (install the latest firmware and try again). I won't be buying any additional Tyan boards and sent all but 2 of that batch back (I'd deployed 2 as desktops while waiting for answers from Tyan and was too lazy to swap them out).
Back in the 1990-1998 era, I used lots and lots of Tyan tomcat boards and was happy with support. So perhaps they are dropping (or have dropped) the ball.
Cheers,
I had a really hard time with them when trying to troubleshoot issues with the 2895 board and the Highpoint Rocketraid 2224. Highpoint would email me replies within the hour and returned a few phone calls. Emails and phone calls to Tyan went unanswered for about 10 days at which point I got an extremely unhelpful canned email response (install the latest firmware and try again). I won't be buying any additional Tyan boards and sent all but 2 of that batch back (I'd deployed 2 as desktops while waiting for answers from Tyan and was too lazy to swap them out).
Back in the 1990-1998 era, I used lots and lots of Tyan tomcat boards and was happy with support. So perhaps they are dropping (or have dropped) the ball.
:(
Where else can one go for solid, high I/O whitebox motherboards that support AMD processors?
Feizhou wrote:
I had a really hard time with them when trying to troubleshoot issues with the 2895 board and the Highpoint Rocketraid 2224. Highpoint would email me replies within the hour and returned a few phone calls. Emails and phone calls to Tyan went unanswered for about 10 days at which point I got an extremely unhelpful canned email response (install the latest firmware and try again). I won't be buying any additional Tyan boards and sent all but 2 of that batch back (I'd deployed 2 as desktops while waiting for answers from Tyan and was too lazy to swap them out).
Back in the 1990-1998 era, I used lots and lots of Tyan tomcat boards and was happy with support. So perhaps they are dropping (or have dropped) the ball.
:(
Where else can one go for solid, high I/O whitebox motherboards that support AMD processors?
I'm pretty happy with our latest batch of Supermicro rackmounts. When I get to work, I'll pop the lid off one and check the board model #. With a 3ware 9550 and 8 x 500gig and 8 x 750gig barracudas, I was able to get several hundred mb/sec disk I/O according to bonnie++. I think I posted the numbers to the list a while back.
Cheers,
Feizhou wrote:
I had a really hard time with them when trying to troubleshoot issues with the 2895 board and the Highpoint Rocketraid 2224. Highpoint would email me replies within the hour and returned a few phone calls. Emails and phone calls to Tyan went unanswered for about 10 days at which point I got an extremely unhelpful canned email response (install the latest firmware and try again). I won't be buying any additional Tyan boards and sent all but 2 of that batch back (I'd deployed 2 as desktops while waiting for answers from Tyan and was too lazy to swap them out).
Back in the 1990-1998 era, I used lots and lots of Tyan tomcat boards and was happy with support. So perhaps they are dropping (or have dropped) the ball.
:(
Where else can one go for solid, high I/O whitebox motherboards that support AMD processors? _______________________________________________
The boards we've got in our newer servers are Supermicro H8DA8.
http://www.supermicro.com/Aplus/motherboard/Opteron/8131/H8DA8.cfm
So far, I've been pretty happy with them. They're all being used for storing lots of uncompressed video or fondling uncompressed video into H.264/mpeg4 streams.
Cheers,
The boards we've got in our newer servers are Supermicro H8DA8.
http://www.supermicro.com/Aplus/motherboard/Opteron/8131/H8DA8.cfm
:(
Nobody distributes Supermicro in Hong Kong anymore. Last I checked anyway. In fact, even Supermicro confirms it too:
http://www.supermicro.com/wheretobuy/asia.cfm?rgn=134
The remaining available Supermicro boards are all Intel ones. Oh well.
Feizhou wrote:
The boards we've got in our newer servers are Supermicro H8DA8.
http://www.supermicro.com/Aplus/motherboard/Opteron/8131/H8DA8.cfm
:(
Nobody distributes Supermicro in Hong Kong anymore. Last I checked anyway. In fact, even Supermicro confirms it too:
http://www.supermicro.com/wheretobuy/asia.cfm?rgn=134
The remaining available Supermicro boards are all Intel ones. Oh well.
Well, I will be in HK in a few weeks but maybe these boards are too bulky to bring in my hand luggage. :)
Feizhou wrote:
The boards we've got in our newer servers are Supermicro H8DA8.
http://www.supermicro.com/Aplus/motherboard/Opteron/8131/H8DA8.cfm
:(
Nobody distributes Supermicro in Hong Kong anymore. Last I checked anyway. In fact, even Supermicro confirms it too:
http://www.supermicro.com/wheretobuy/asia.cfm?rgn=134
The remaining available Supermicro boards are all Intel ones. Oh well.
We have been using the Tyan Thunder S2882GNR-D Motherboard and AMD Opteron 270's and Chenbro Chassis with much success since the Supermicro stuff looks like Intel is holding sway... With the Zippy brand power supplied, LSI-Logic RAID 2x, broadcomm NICs, Crucial memory, and Fujitsu Ultra320 SCSI disks we see very little breakage, like none, on the app cluster.
-krb
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
We have been using the Tyan Thunder S2882GNR-D Motherboard and AMD Opteron 270's and Chenbro Chassis with much success since the Supermicro stuff looks like Intel is holding sway... With the Zippy brand power supplied, LSI-Logic RAID 2x, broadcomm NICs, Crucial memory, and Fujitsu Ultra320 SCSI disks we see very little breakage, like none, on the app cluster.
Those older AMD chipset boards are solid are not they? I installed a dual Opteron 242 box with 2GB of RAM (4x512MB) using the 2881 board I was talking about. This was about two years ago. A 3ware 7508 + 6 200GB IDE disks + 2 scsi disks as the system disks + 350 watt redundant, ATX sized, 2themax power supply :D. I never got a complaint about its stability.
Another Tyan box that is relatively new that (Tyan 2865) I built (but running Windows XP 64-bit) has also been rock solid once I replaced the incredibly expensive and incredibly unstable 3dlabs Wildcat Realizm 500 with an elcheapo ATI card. If I had known that 3dlabs was leaving the market...
So if Tyan is losing its edge with its new stuff...that is a real pity.
have you tried single sourcing the ram in one of those machines? I think the mixed capacities are causing issues.
Dan Halbert wrote:
We have been seeing failures with CentOS 4.4 i386 (not x86_64) running compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, running Opteron 265's. This motherboard is used in the Tyan barebones box GT24 (B2881). We have these boards populated with 8GB of RAM, consisting of mixed 2GB and 1GB sticks.
The symptom is that CPU-bound programs (may or may not be related to floating point) fail randomly and intermittently, with wrong answers or segfaults. Running several in parallel seems to make the failures more likely. We have not seen any kernel crashes. It is not hard to reproduce the problem with some internal programs we have; it takes only a few minutes.
This is using a completely-up-to-date-as-of-yesterday CentOS 4.4 i386, hugemem or not doesn't make a difference. We have seen this on many boxes, so it's not bad memory. We do NOT see this problem if we run CentOS 4.4 x86_64 on the same boxes, using the same 32-bit test executables. We also don't see this problem on some slightly older boxes with Tyan K8SD motherboards running CentOS 4.4 i386 (also Opteron 265's, with 8GB of 1GB DIMMs).
We have been looking at BIOS settings, but haven't seen anything that stands out. memtest86 does not show errors.
Thanks for any suggestions of what this issue might be, Dan _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Dan Halbert wrote:
We have been seeing failures with CentOS 4.4 i386 (not x86_64) running compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards...
I talked with a friend who's been using a lot of Thunder T8WE boards with dual opterons for structural engineering systems software, he says ...
They are VERY PICKY about memory, I have learned. Do *NOT*, I repeat, do *NOT* cheap out on your memory.
I have hosed a perfectly good system that took almost a week of troubleshooting due to sh**ty memory corrupting the raid
We now only buy memory that is certified for Supermicro and Tyan systems. It doesn't cost a whole lot more then the cheap stuff.
memtest might not catch issues that are multiprocessor related.