On Mon, Feb 14, 2011 at 4:49 AM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
On Mon, 14 Feb 2011, Nico Kadel-Garcia wrote:
But the accumulated costs of the higher end motherboard, memory, shortage of space for upgrades in the same unit, the downtime at the BIOS to reset the "disabled by default" ECC settings in the BIOS, and the system monitoring to detect and manage such errors add up *really fast* in a moderate sized shop.
Really? Tweaking a BIOS setting is a silly argument, you'll typically find it's configured by default, and if you can't get BIOS settings right when you setup that's your own fault.
Trust me, it's a pain in the keister in production. If the standard is now enabled, good: I haven't had my hands inside a server in a year, I admit it. (My current role doesn't call for it.) It *didn't* used to be standard. Are you sure it is? I'm still seeing notes that the motherboards thtat support it are still significantly more expensive, "server grade". Unfortunately, I've worked for a manufacturer that repackaged consumer grade components for cheap pizza box servers, and we had some disagreements about where they cut corners.
It's very awkward to preserve BIOS settings across BIOS updates (read: impossible without a manual checklist) unless your environment is so sophisticated you're using LinuxBIOS. Unless you've *really* invested and gotten remote KVM boxes or invested in Dell's DRAC or HP's remote console tools, *and set them up correctly at install time, and kept their network setups up to date*, they're a nightmare to do remotely with someone putting hands and eyes on the server. And the remote tools are *awful* at giving you BIOS access, often because the changes in screen resolution for different parts of the boot process confuse the remote console tools, at least if you use the standard VGA like access because you haven't set the console access because that *often requires someone to enable it from the BIOS*, which leads to a serious circular dependency.
Now scale by a stack of slightly different models of servers with diferent interfaces for their BIOS management, and you have a mess to manage. I *LOVE* environments where the admins have been able to insist on, or install, LinuxBIOS because this is *solved* there. You can get at it from Linux userland as necessary, they reboot *much* faster, and you can download and backup the configurations for system reporting. It's my friend.
Buy a Dell server with ECC. Don't install any software at all. Come ECC error, you'll have an orange blinky light immediately warning you of impending doom, and it'll even tell you on the front display details of the fault, including which DIMM needs replacing. If you can be bothered to install OMSA (run one command, one yum install), it'll drop you an email when if fails.
Dells are solid, server class machines. I've seen HP oversold with a lot of promises about management tools that don't work that well, for tasks better integrated and managed by userland tools that *have to be done anyway*, and sold with a lot of genuinely unnecessary features. (Whose bright idea was it to switch servers to laptop hard drives? E-e-e-e-e-w-w-w-w-w!!!"
Compared with not running with ECC, you wait until your machine randomly reboots. You ponder whether it's RAM/CPU/Motherboard. You just ignore it. It does it again. You then have a fun game of running memtest while pulling DIMMs out to try to work out which of the 16 are causing the issue. Joy unbounded.
ECC has a point, which I've acknowledged. But the overall "server class" hardware costs add up fast. SAS hard drives, 10Gig ethernet ports, dual power supplies, built-in remote KVM, expensive racking hardware, 15,000 RPM drives instead of 10,000 RPM, SAS instead of SATA, etc. all start adding up really fast when all you need is a so-called "pizza box".
This is one reason I've gotten fond of virtualization. (VMWare or VirtualBox for CentOS 5, we'll see about KVM for RHEL and CentOS 6). Amortizing the costs of a stack of modest servers with such server class features across one central, overpowered server and doling out environments as necessary is very efficient and avoids a lot of the hardware management problems.
And what do you mean about shortage of space for upgrades? What that has to do with ECC I'll have no idea.
It's the overall "enterprise class hardware" meme that I'm concerned about for a one-off CentOS grade server.
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
I've had plenty of HPC nodes that have warned of corrected memory errors. I've been able to drop them out of the queues, get the memory fixed, and put them back into service without anyone noticing. Without ECC, I've potentially introduced errors into their results, and you're much more likely to get the first random reboot without warning, costing them time. I've had memory errors creep in after 4 years, it's not something that always bites at the beginning.
Are you sure it was fixed by memory replacement? Because I've seen most of my ECC reports as one-offs, never to recur again.
Equally I've had file servers do the same. Running a file server without ECC is a recipe for disaster, as you're risking silent data corruption.
Core file servers, I'd agree, although a lot of the more common problems (such as single very expensive fileserver failure and lack of user available snapshots) are ameliorated by other approaches. (Multiple cheap SATA external hard drives for snapshot backups, NFS access so the users can recover personally deleted files, single points of failure in upstream connectivity, etc.)