On 08/12/13 18:16, Warren Young wrote:
On 8/12/2013 12:54, m.roth@5-cent.us wrote:
Well, *all* of these are rackmount servers, with no moving-the-server wear.
Our servers are all rack-mounted, too, and pretty much never get moved after being installed.
In any case, I was referring to wear in the electromechanical components of a server. HDDs and fans, primarily. In olden days, optical disks, too. These are expected to fail over time.
We start seeing userspace compute-intensive processes crashing the system a number of times a day.
Define "crash the system".
The whole system reboots. <snip>
I don't suppose you've gathered continuous temp data, say with Cacti?
No, I haven't. It's a thought, thought the HVACs good (too good, he says, when he needs a long sleeved shirt, and sometimes a sweater). ipmitool sel list isn't showing a problem.
They replace the m/b, and it doesn't happen again.
Oh, except for the one or two that we sent back a *second* time, and they replaced the m/b again....
Okay, so either this one motherboard product from Supermicro has a QC problem, or Penguin has an application or design problem with it. Or, your environment is somehow pushing them past their design limits. (e.g. insufficient cooling)
That's certainly not the problem.
You're painting with far too broad a brush here to say Supermicro is bad, period.
You like them, fine. We really don't, and the only thing that we were buying that had their m/b, etc, were honkin' hot severs.
mark