[CentOS] server specifications

On Mon, Feb 14, 2011 at 4:49 AM, John Hodrien <J.H.Hodrien at leeds.ac.uk> wrote:
> On Mon, 14 Feb 2011, Nico Kadel-Garcia wrote:
>
>> But the accumulated costs of the higher end motherboard, memory,
>> shortage of space for upgrades in the same unit, the downtime at the
>> BIOS to reset the "disabled by default" ECC settings in the BIOS, and
>> the system monitoring to detect and manage such errors add up *really
>> fast* in a moderate sized shop.
>
> Really?  Tweaking a BIOS setting is a silly argument, you'll typically find
> it's configured by default, and if you can't get BIOS settings right when you
> setup that's your own fault.

Trust me, it's a pain in the keister in production. If the standard is
now enabled, good: I haven't had my hands inside a server in a year, I
admit it. (My current role doesn't call for it.) It *didn't* used to
be standard. Are you sure it is? I'm still seeing notes that the
motherboards thtat support it are still significantly more expensive,
"server grade". Unfortunately, I've worked for a manufacturer that
repackaged consumer grade components for cheap pizza box servers, and
we had some disagreements about where they cut corners.

It's very awkward to preserve BIOS settings across BIOS updates (read:
impossible without a manual checklist) unless your environment is so
sophisticated you're using LinuxBIOS. Unless you've *really* invested
and gotten remote KVM boxes or invested in Dell's DRAC or HP's remote
console tools, *and set them up correctly at install time, and kept
their network setups up to date*, they're a nightmare to do remotely
with someone putting hands and eyes on the server. And the remote
tools are *awful* at giving you BIOS access, often because the changes
in screen resolution for different parts of the boot process confuse
the remote console tools, at least if you use the standard VGA like
access because you haven't set the console access because that *often
requires someone to enable it from the BIOS*, which leads to a serious
circular dependency.

Now scale by a stack of slightly different models of servers with
diferent interfaces for their BIOS management, and you have a mess to
manage. I *LOVE* environments where the admins have been able to
insist on, or install, LinuxBIOS because this is *solved* there. You
can get at it from Linux userland as necessary, they reboot *much*
faster, and you can download and backup the configurations for system
reporting. It's my friend.

> Buy a Dell server with ECC.  Don't install any software at all.  Come ECC
> error, you'll have an orange blinky light immediately warning you of impending
> doom, and it'll even tell you on the front display details of the fault,
> including which DIMM needs replacing.  If you can be bothered to install OMSA
> (run one command, one yum install), it'll drop you an email when if fails.

Dells are solid, server class machines. I've seen HP oversold with a
lot of promises about management tools that don't work that well, for
tasks better integrated and managed by userland tools that *have to be
done anyway*, and sold with a lot of genuinely unnecessary features.
(Whose bright idea was it to switch servers to laptop hard drives?
E-e-e-e-e-w-w-w-w-w!!!"

> Compared with not running with ECC, you wait until your machine randomly
> reboots.  You ponder whether it's RAM/CPU/Motherboard.  You just ignore it.
> It does it again.  You then have a fun game of running memtest while pulling
> DIMMs out to try to work out which of the 16 are causing the issue.  Joy
> unbounded.

ECC has a point, which I've acknowledged. But the overall "server
class" hardware costs add up fast. SAS hard drives, 10Gig ethernet
ports, dual power supplies, built-in remote KVM, expensive racking
hardware, 15,000 RPM drives instead of 10,000 RPM, SAS instead of
SATA, etc. all start adding up really fast when all you need is a
so-called "pizza box".

This is one reason I've gotten fond of virtualization. (VMWare or
VirtualBox for CentOS 5, we'll see about KVM for RHEL and CentOS 6).
Amortizing the costs of a stack of modest servers with such server
class features across one central, overpowered server and doling out
environments as necessary is very efficient and avoids a lot of the
hardware management problems.

> And what do you mean about shortage of space for upgrades?  What that has to
> do with ECC I'll have no idea.

It's the overall "enterprise class hardware" meme that I'm concerned
about for a one-off CentOS grade server.

>> Pleae, name a single instance in the last 10 years where ECC
>> demonstrably saved you work, especially if you made sure ti burn in
>> the ssytem components on servers upon their first bootup...
>
> I've had plenty of HPC nodes that have warned of corrected memory errors.
> I've been able to drop them out of the queues, get the memory fixed, and put
> them back into service without anyone noticing.  Without ECC, I've potentially
> introduced errors into their results, and you're much more likely to get the
> first random reboot without warning, costing them time.  I've had memory
> errors creep in after 4 years, it's not something that always bites at the
> beginning.

Are you sure it was fixed by memory replacement? Because I've seen
most of my ECC reports as one-offs, never to recur again.

> Equally I've had file servers do the same.  Running a file server without ECC
> is a recipe for disaster, as you're risking silent data corruption.

Core file servers, I'd agree, although a lot of the more common
problems (such as single very expensive fileserver failure and lack of
user available snapshots) are ameliorated by other approaches.
(Multiple cheap SATA external hard drives for snapshot backups, NFS
access so the users can recover personally deleted files, single
points of failure in upstream connectivity, etc.)