[CentOS] server specifications

Mon Feb 14 09:49:41 UTC 2011
John Hodrien <J.H.Hodrien at leeds.ac.uk>

On Mon, 14 Feb 2011, Nico Kadel-Garcia wrote:

> But the accumulated costs of the higher end motherboard, memory,
> shortage of space for upgrades in the same unit, the downtime at the
> BIOS to reset the "disabled by default" ECC settings in the BIOS, and
> the system monitoring to detect and manage such errors add up *really
> fast* in a moderate sized shop.

Really?  Tweaking a BIOS setting is a silly argument, you'll typically find
it's configured by default, and if you can't get BIOS settings right when you
setup that's your own fault.

Buy a Dell server with ECC.  Don't install any software at all.  Come ECC
error, you'll have an orange blinky light immediately warning you of impending
doom, and it'll even tell you on the front display details of the fault,
including which DIMM needs replacing.  If you can be bothered to install OMSA
(run one command, one yum install), it'll drop you an email when if fails.

Compared with not running with ECC, you wait until your machine randomly
reboots.  You ponder whether it's RAM/CPU/Motherboard.  You just ignore it.
It does it again.  You then have a fun game of running memtest while pulling
DIMMs out to try to work out which of the 16 are causing the issue.  Joy
unbounded.

And what do you mean about shortage of space for upgrades?  What that has to
do with ECC I'll have no idea.

> Pleae, name a single instance in the last 10 years where ECC
> demonstrably saved you work, especially if you made sure ti burn in
> the ssytem components on servers upon their first bootup...

I've had plenty of HPC nodes that have warned of corrected memory errors.
I've been able to drop them out of the queues, get the memory fixed, and put
them back into service without anyone noticing.  Without ECC, I've potentially
introduced errors into their results, and you're much more likely to get the
first random reboot without warning, costing them time.  I've had memory
errors creep in after 4 years, it's not something that always bites at the
beginning.

Equally I've had file servers do the same.  Running a file server without ECC
is a recipe for disaster, as you're risking silent data corruption.

jh