On 8/12/2013 12:54, m.roth at 5-cent.us wrote: > > Well, *all* of these are rackmount servers, with no moving-the-server > wear. Our servers are all rack-mounted, too, and pretty much never get moved after being installed. In any case, I was referring to wear in the electromechanical components of a server. HDDs and fans, primarily. In olden days, optical disks, too. These are expected to fail over time. > We start seeing userspace compute-intensive processes crashing the > system a number of times a day. Define "crash the system". Hard lock-up, requiring a power toggle or Reset press? Server unresponsive to keyboard, except for Ctrl-Alt-Del? Kernel panic? X11 unresponsive but you can still ssh in? User program dies mysteriously, but other programs still run? Keyboard lights blink in patterns, monitor won't wake on mouse wiggle? Box reboots spontaneously? BIOS beeps? I don't suppose you've gathered continuous temp data, say with Cacti? > They replace the m/b, and it doesn't happen again. Okay, so either this one motherboard product from Supermicro has a QC problem, or Penguin has an application or design problem with it. Or, your environment is somehow pushing them past their design limits. (e.g. insufficient cooling) You're painting with far too broad a brush here to say Supermicro is bad, period.