Did somebody can give me some advises on hardware for building a Centos linux server?
--- Michel Donais
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Michel Donais said the following on 13/02/11 16:26:
Did somebody can give me some advises on hardware for building a Centos linux server?
What will you put on that server?
Ciao, luigi
- -- / +--[Luigi Rosa]-- \
I used to wish the universe were fair. Then one day it hit me: What if the universe were fair? Then all the awful things that happen to us in life, would happen because we deserved them. So now I take great pleasure in the general hostility and unfairness of things. --Marcus Cole, "A Late Delivery from Avalon", Babylon 5
On 02/13/2011 10:26 AM, Michel Donais wrote:
Did somebody can give me some advises on hardware for building a Centos linux server?
Look for vendors that specifically list Red Hat Enterprise Linux 5 as a supported operating system. Most major vendors should offer servers with this support.
If you are asking specs though, then you need to provide a lot more details on what the server will do.
what about looking in the archives? You are really not the first person asking this.
Kai
On 02/13/2011 07:26 AM, Michel Donais wrote:
Did somebody can give me some advises on hardware for building a Centos linux server?
Depends on what you intend to to with it.
If you are going to just run a firewall - the cheapest, slowest, smallest server you can buy will do fine. If you are going to support the newest 'Hot Thing(tm)' on the net and handle 500,000,000 web page views a day you are going to need heftier hardware.
You did the equivalent of asking 'can someone give me some advice on buying a truck?'
Before any possible answer can be given, the first question must be: What do you plan to do with it?
Before any possible answer can be given, the first question must be: What do you plan to do with it?
You'r right, but I have to begin somewhere.
This hardware is intended to be a terminal server for at least 40 users driven with LTSP.for BBx Pro-5 and Bbj applications Need fast and huge storage, 2 lan fast connexions. No intensive mail or web browsing, 1 or 2 outside (xtranet ) users; back-up will be on an SLR-100 tape drive
The load for 20 users and for the last 6 years with RH9 is actually supported by a MOTHER BOARD : MSI KT-3 ULTRA DDR 333(3) CE (ATX form) CPU : AMD-2100 XP memory : 2 DDRam 333 (512MEG) for a total of 1024meg 1 scsi controller adaptec 29160 SCSI 3 hard disc SEAGATE CHETAH ULTRA SCSI-ULTRA-320 LVD (68 pin)
It have been enough for that tme but users number is raising and by the way an upgrade of the computing capacity will be usefull.
I checked recently for an ASUS S775 P5Q-VM G45 PCIE MOTHERBOARD with an INTEL CORE 2 QUAD Q9550 2.83G/1333/12M/S775 with SATA hard disc no Raid I doesn't seem to be a server board and I'm not shure of that choice.
--- Michel Donais
2011/2/14 Michel Donais donais@telupton.com:
Before any possible answer can be given, the first question must be: What do you plan to do with it?
You'r right, but I have to begin somewhere.
This hardware is intended to be a terminal server for at least 40 users driven with LTSP.for BBx Pro-5 and Bbj applications Need fast and huge storage, 2 lan fast connexions. No intensive mail or web browsing, 1 or 2 outside (xtranet ) users; back-up will be on an SLR-100 tape drive
The load for 20 users and for the last 6 years with RH9 is actually supported by a MOTHER BOARD : MSI KT-3 ULTRA DDR 333(3) CE (ATX form) CPU : AMD-2100 XP memory : 2 DDRam 333 (512MEG) for a total of 1024meg 1 scsi controller adaptec 29160 SCSI 3 hard disc SEAGATE CHETAH ULTRA SCSI-ULTRA-320 LVD (68 pin)
It have been enough for that tme but users number is raising and by the way an upgrade of the computing capacity will be usefull.
I checked recently for an ASUS S775 P5Q-VM G45 PCIE MOTHERBOARD with an INTEL CORE 2 QUAD Q9550 2.83G/1333/12M/S775 with SATA hard disc no Raid I doesn't seem to be a server board and I'm not shure of that choice.
Maybe you just want to pick real server hardware? Pick one of Dell RXX series.
-- Eero
On 02/13/11 2:42 PM, Michel Donais wrote:
I checked recently for an ASUS S775 P5Q-VM G45 PCIE MOTHERBOARD with an INTEL CORE 2 QUAD Q9550 2.83G/1333/12M/S775 with SATA hard disc no Raid I doesn't seem to be a server board and I'm not shure of that choice.
thats desktop hardware. no ECC support. anything running a business application for 50 users is probably mission important or mission critical, and undetected creeping bit errors due to lack of ECC would be, in my book, unacceptable.
I'd probably use a HP or Dell 1U or 2U server, with redundant power, ECC memory, and at least mirrored system drives.
undetected creeping bit errors due to lack of ECC would be, in my book, unacceptable.
Where can one find info or studies on this sort of thing? I use non-ecc ram in several servers, and of course most ppl use it in their desktops.
Wouldn't bit errors result in crashes or data corruption? Or what would the results be?
On 2/13/11 7:55 PM, compdoc wrote:
undetected creeping bit errors due to lack of ECC would be, in my book, unacceptable.
Where can one find info or studies on this sort of thing? I use non-ecc ram in several servers, and of course most ppl use it in their desktops.
Wouldn't bit errors result in crashes or data corruption? Or what would the results be?
It's very unpredictable. Since linux tends to use all available ram for disk buffers, the first thing is likely to be corruption in disk files. By the time you see crashes or anything visible, you may have a lot of invalid data.
On Sun, Feb 13, 2011 at 8:55 PM, compdoc compdoc@hotrodpc.com wrote:
undetected creeping bit errors due to lack of ECC would be, in my book, unacceptable.
Where can one find info or studies on this sort of thing? I use non-ecc ram in several servers, and of course most ppl use it in their desktops.
Wouldn't bit errors result in crashes or data corruption? Or what would the results be?
ECC allows for single bit errors to be corrected and multiple bit errors to be noticed. All our servers run ECC memory. I've had memory go bad where the logs were showing correctable ECC errors. Since they were correctable no data was corrupt and I was able to replace the bad memory. Had I been using regular memory who knows what data could have been potentially corrupt. Just like you use RAID to provide higher reliability for drives you should use ECC memory. The cost different is negligible.
Ryan
On Sun, 2011-02-13 at 19:21 -0700, compdoc wrote:
ECC allows for single bit errors to be corrected and multiple bit errors to be noticed.
I know what it is and I've used it in the past, but I just don't see many errors going on in desktop computers and servers that use non-ecc ram.
I agree: ditto servers, VPSs, laptops, netbooks and desktops running C5.5.
On Mon, 14 Feb 2011, Always Learning wrote:
On Sun, 2011-02-13 at 19:21 -0700, compdoc wrote:
ECC allows for single bit errors to be corrected and multiple bit errors to be noticed.
I know what it is and I've used it in the past, but I just don't see many errors going on in desktop computers and servers that use non-ecc ram.
I agree: ditto servers, VPSs, laptops, netbooks and desktops running C5.5.
I think this is the bit people really don't get with ECC.
You say you've not had any problems with lots of machines running without ECC.
I'd argue that you have to read that differently. You don't *think* you've had any problems caused by ECC. Any unexplained crash *could* have been the result of a memory error. Any memory error *could* have changed what was written to disk on your machine. But you'd never notice, because you're not running ECC.
ECC's not all about preventing errors, it's about making it almost certain that you'll notice them when they happen. ECC will cause a machine with bad memory to crash and log an error where a machine without might have carried blissfully on while silently corrupting data, or behaving incorrectly.
jh
On Sun, Feb 13, 2011 at 7:01 PM, John R Pierce pierce@hogranch.com wrote:
On 02/13/11 2:42 PM, Michel Donais wrote:
I checked recently for an ASUS S775 P5Q-VM G45 PCIE MOTHERBOARD with an INTEL CORE 2 QUAD Q9550 2.83G/1333/12M/S775 with SATA hard disc no Raid I doesn't seem to be a server board and I'm not shure of that choice.
thats desktop hardware. no ECC support. anything running a business application for 50 users is probably mission important or mission critical, and undetected creeping bit errors due to lack of ECC would be, in my book, unacceptable.
I'd probably use a HP or Dell 1U or 2U server, with redundant power, ECC memory, and at least mirrored system drives.
It's also possible to save the budget, buy *two* similarly powerful used systems with much lesser hardware specs, and have genuine failover instead of the shared vulnerability of one expensive server with high-availability components as you describe. I've done both, and encourage using less expensive hardware in pairs: that makes upgrading a lot cheaper and helps avoid the single points of failure of high end hardware. HP's older "Proliant Server Packs" and their ability to completely mishandle the Broadcom network drivers on RHEL and CentOS, in particular, come to mind.
On 02/13/11 7:06 PM, Nico Kadel-Garcia wrote:
It's also possible to save the budget, buy *two* similarly powerful used systems with much lesser hardware specs, and have genuine failover instead of the shared vulnerability of one expensive server with high-availability components as you describe. I've done both, and encourage using less expensive hardware in pairs: that makes upgrading a lot cheaper and helps avoid the single points of failure of high end hardware. HP's older "Proliant Server Packs" and their ability to completely mishandle the Broadcom network drivers on RHEL and CentOS, in particular, come to mind.
you still want ECC memory in a server... and redundant power in a 1U is really no big deal.
By doubling the hardware, you still do not overcome the potential corruption that could occur with non-ecc memory. If this is truly a mission critical application then it really does not serve much of a purpose to short change yourself with substandard hardware.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of John R Pierce Sent: Sunday, February 13, 2011 7:17 PM To: centos@centos.org Subject: Re: [CentOS] server specifications
On 02/13/11 7:06 PM, Nico Kadel-Garcia wrote:
It's also possible to save the budget, buy *two* similarly powerful used systems with much lesser hardware specs, and have genuine failover instead of the shared vulnerability of one expensive server with high-availability components as you describe. I've done both, and encourage using less expensive hardware in pairs: that makes upgrading a lot cheaper and helps avoid the single points of failure of high end hardware. HP's older "Proliant Server Packs" and their ability to completely mishandle the Broadcom network drivers on RHEL and CentOS, in particular, come to mind.
you still want ECC memory in a server... and redundant power in a 1U is really no big deal.
_______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Sun, Feb 13, 2011 at 10:23 PM, David Brian Chait dchait@invenda.com wrote:
By doubling the hardware, you still do not overcome the potential corruption that could occur with non-ecc memory. If this is truly a mission critical application then it really does not serve much of a purpose to short change yourself with substandard hardware.
First, please don't top post in this group.
Second, you've got a historically valid point about ECC's advantages. But the accumulated costs of the higher end motherboard, memory, shortage of space for upgrades in the same unit, the downtime at the BIOS to reset the "disabled by default" ECC settings in the BIOS, and the system monitoring to detect and manage such errors add up *really fast* in a moderate sized shop.
Worse, I've seen some serious false economies with memory. People with tight budgets getting third party memory to install themselves, then losing all their "savings" in downtime because they had trouble telling the difference between "hard enough to seat the RAM" and "hard enough to crack the motherboard, cut your hand, and bleed all over important junctions".
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
Nico Kadel-Garcia wrote:
On Sun, Feb 13, 2011 at 10:23 PM, David Brian Chait dchait@invenda.com wrote:
By doubling the hardware, you still do not overcome the potential corruption that could occur with non-ecc memory. If this is truly a mission critical application then it really does not serve much of a purpose to short change yourself with substandard hardware.
First, please don't top post in this group.
Second, you've got a historically valid point about ECC's advantages. But the accumulated costs of the higher end motherboard, memory, shortage of space for upgrades in the same unit, the downtime at the BIOS to reset the "disabled by default" ECC settings in the BIOS, and the system monitoring to detect and manage such errors add up *really fast* in a moderate sized shop.
Worse, I've seen some serious false economies with memory. People with tight budgets getting third party memory to install themselves, then losing all their "savings" in downtime because they had trouble telling the difference between "hard enough to seat the RAM" and "hard enough to crack the motherboard, cut your hand, and bleed all over important junctions".
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
Twice in the last two years my intel server mb with ECC RAM showed errors (after moving system physically) and thus I did a reseat (after cleaning) of the modules and all is now well. No data lost, complete confidence - definitely gets my vote for servers!!
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Mon, Feb 14, 2011 at 12:16 AM, Rob Kampen rkampen@kampensonline.com wrote:
Nico Kadel-Garcia wrote:
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
Twice in the last two years my intel server mb with ECC RAM showed errors (after moving system physically) and thus I did a reseat (after cleaning) of the modules and all is now well. No data lost, complete confidence - definitely gets my vote for servers!!
Same system? Did you burn it in (running it under serious load with memory and CPU testing tools for a day or two after initial installation)? And given that you opened it up, I also assume you cleaned out accumulated dust and cleaned the filters.
On 2/14/2011 12:29 AM, Nico Kadel-Garcia wrote:
On Mon, Feb 14, 2011 at 12:16 AM, Rob Kampenrkampen@kampensonline.com wrote:
Nico Kadel-Garcia wrote:
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
Twice in the last two years my intel server mb with ECC RAM showed errors (after moving system physically) and thus I did a reseat (after cleaning) of the modules and all is now well. No data lost, complete confidence - definitely gets my vote for servers!!
Same system? Did you burn it in (running it under serious load with memory and CPU testing tools for a day or two after initial installation)? And given that you opened it up, I also assume you cleaned out accumulated dust and cleaned the filters. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
A burn in only tests the ram at burn in. Later as parts wear(and electronic parts DO wear) bit errors can begin. There's two ways to hanlde this: 1. spend maybe 5% more for ecc memory so bit errors can be either fixed or alerte3d automatically 2. save 5% money wise but loose more time to burn in your system at regular intervals to make sure nothing is failing 3. Do nothing. Save the 5% and go with the...it's worked before...
When number 3 bites you in the arse the costs of your penny-pinching laziness will be many orders of magnitude higher..due to file system corruption, backup corruption..etc etc etc. If the system is doing bit errors those bit errors WILL show up in your backups. If the machine has been in service for years...the costs are even more drastic. Spend 5% on ECC and number 3 won't bite you in the arse...unless you don't monitor your systems at all..then you are going to get hosed anyway. This is one time the 5% is worth the cost.
Nico-Garcia wrote:
On Mon, Feb 14, 2011 at 12:16 AM, Rob Kampen rkampen@kampensonline.com wrote:
Nico Kadel-Garcia wrote:
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
Twice in the last two years my intel server mb with ECC RAM showed errors (after moving system physically) and thus I did a reseat (after cleaning) of the modules and all is now well. No data lost, complete confidence - definitely gets my vote for servers!!
Same system? Did you burn it in (running it under serious load with memory and CPU testing tools for a day or two after initial installation)? And given that you opened it up, I also assume you cleaned out accumulated dust and cleaned the filters.
This system was initially commissioned after burn in, in late 2004 - An Intel mb. It started with RH9, then went FC3, then CentOS5. As mentioned the ECC memory has warned me when things are not well and allowed me to take remedial action before anything impacted my data. It still does great work six years later. For some reason, each time I have shifted it, we started getting these errors. It may be accumulated dust and dirt - so I always clean everything while it is down. Re-seating the RAM after cleaning the contacts and blowing out the dust has always worked. So for me, getting a server grade mb with ECC RAM is a great investment and worth the slight extra cost, not to mention that CentOS seems to have the drivers and modules in place for these mb.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 2/14/2011 9:53 AM, Rob Kampen wrote:
This system was initially commissioned after burn in, in late 2004 - An Intel mb. It started with RH9, then went FC3, then CentOS5. As mentioned the ECC memory has warned me when things are not well and allowed me to take remedial action before anything impacted my data. It still does great work six years later. For some reason, each time I have shifted it, we started getting these errors. It may be accumulated dust and dirt - so I always clean everything while it is down. Re-seating the RAM after cleaning the contacts and blowing out the dust has always worked. So for me, getting a server grade mb with ECC RAM is a great investment and worth the slight extra cost, not to mention that CentOS seems to have the drivers and modules in place for these mb.
I've seen that too, where moving a server would unseat RAM just enough to cause occasional crashes - and the crashes are better than undetected data/file errors. We mostly use IBM servers that have some diagnostic lights in the front to make the problem obvious.
On 2/14/2011 10:53 AM, Rob Kampen wrote:
Nico-Garcia wrote:
On Mon, Feb 14, 2011 at 12:16 AM, Rob Kampen rkampen@kampensonline.com wrote:
Nico Kadel-Garcia wrote:
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
Twice in the last two years my intel server mb with ECC RAM showed errors (after moving system physically) and thus I did a reseat (after cleaning) of the modules and all is now well. No data lost, complete confidence - definitely gets my vote for servers!!
Same system? Did you burn it in (running it under serious load with memory and CPU testing tools for a day or two after initial installation)? And given that you opened it up, I also assume you cleaned out accumulated dust and cleaned the filters.
This system was initially commissioned after burn in, in late 2004 - An Intel mb. It started with RH9, then went FC3, then CentOS5. As mentioned the ECC memory has warned me when things are not well and allowed me to take remedial action before anything impacted my data. It still does great work six years later. For some reason, each time I have shifted it, we started getting these errors. It may be accumulated dust and dirt - so I always clean everything while it is down. Re-seating the RAM after cleaning the contacts and blowing out the dust has always worked. So for me, getting a server grade mb with ECC RAM is a great investment and worth the slight extra cost, not to mention that CentOS seems to have the drivers and modules in place for these mb.
I'm not going to mention that I still have one old Compaq R3000 up and running. It is a 1998 model! It was up over 500 days at one point (when I finally decided a new kernel really did need to go live). It has run 24/7/365(6) since 1998. Started it's life under RH5. Now is Centos 3. It doesn't do anything really critical and is on my list to deactivate simply due to the electricity use. Yes, server class is important. I have since moved to Compaq/HP DL 380s as the primary systems. Again, very much server class and worth every penny.
Also, if you don't need the latest greatest, a lot of these units come off of corporate lease after 1 to 3 years and show up on eBay. I great way to get one at a fantastic bargain. A unit that started it's life as a $10K or so machine, may be under $1K in 3 years. I've had fantastic service out of the Proliant line with the exception of the 1U units. HP makes the Proliant line, but also makes a lot of home use cheap stuff. Fortunately, they so far seem to be following the Compaq goals of building tanks.
All of the 380s seem to come with RILO, or remote insight lights out... which allows you to set up an alternate IP address into this separate card. From there, it is just like you are on the local console with even just a bit more control. For instance, you can power down the system and then power it back up from your remote location. Very nice. Also, redundant power supplies, cooling fans and on and on it goes. Yes, the setup software is a bit odd. This programs bios and raid systems.
Anyway, it's an alternative method if you don't need hoards of horsepower but if reliability is most important. As always, watch the rating of any seller. I've had good luck over the years.
John
Anyway, it's an alternative method if you don't need hoards of horsepower but if reliability is most important. As always, watch the rating of any seller. I've had good luck over the years.
I really like Dell iDrac remote management and really good linux support on hardware like omsa.
Fujitsu siemens also works.. but not so good :)
-- Eero
On Mon, 14 Feb 2011, Nico Kadel-Garcia wrote:
But the accumulated costs of the higher end motherboard, memory, shortage of space for upgrades in the same unit, the downtime at the BIOS to reset the "disabled by default" ECC settings in the BIOS, and the system monitoring to detect and manage such errors add up *really fast* in a moderate sized shop.
Really? Tweaking a BIOS setting is a silly argument, you'll typically find it's configured by default, and if you can't get BIOS settings right when you setup that's your own fault.
Buy a Dell server with ECC. Don't install any software at all. Come ECC error, you'll have an orange blinky light immediately warning you of impending doom, and it'll even tell you on the front display details of the fault, including which DIMM needs replacing. If you can be bothered to install OMSA (run one command, one yum install), it'll drop you an email when if fails.
Compared with not running with ECC, you wait until your machine randomly reboots. You ponder whether it's RAM/CPU/Motherboard. You just ignore it. It does it again. You then have a fun game of running memtest while pulling DIMMs out to try to work out which of the 16 are causing the issue. Joy unbounded.
And what do you mean about shortage of space for upgrades? What that has to do with ECC I'll have no idea.
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
I've had plenty of HPC nodes that have warned of corrected memory errors. I've been able to drop them out of the queues, get the memory fixed, and put them back into service without anyone noticing. Without ECC, I've potentially introduced errors into their results, and you're much more likely to get the first random reboot without warning, costing them time. I've had memory errors creep in after 4 years, it's not something that always bites at the beginning.
Equally I've had file servers do the same. Running a file server without ECC is a recipe for disaster, as you're risking silent data corruption.
jh
On Mon, Feb 14, 2011 at 4:49 AM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
On Mon, 14 Feb 2011, Nico Kadel-Garcia wrote:
But the accumulated costs of the higher end motherboard, memory, shortage of space for upgrades in the same unit, the downtime at the BIOS to reset the "disabled by default" ECC settings in the BIOS, and the system monitoring to detect and manage such errors add up *really fast* in a moderate sized shop.
Really? Tweaking a BIOS setting is a silly argument, you'll typically find it's configured by default, and if you can't get BIOS settings right when you setup that's your own fault.
Trust me, it's a pain in the keister in production. If the standard is now enabled, good: I haven't had my hands inside a server in a year, I admit it. (My current role doesn't call for it.) It *didn't* used to be standard. Are you sure it is? I'm still seeing notes that the motherboards thtat support it are still significantly more expensive, "server grade". Unfortunately, I've worked for a manufacturer that repackaged consumer grade components for cheap pizza box servers, and we had some disagreements about where they cut corners.
It's very awkward to preserve BIOS settings across BIOS updates (read: impossible without a manual checklist) unless your environment is so sophisticated you're using LinuxBIOS. Unless you've *really* invested and gotten remote KVM boxes or invested in Dell's DRAC or HP's remote console tools, *and set them up correctly at install time, and kept their network setups up to date*, they're a nightmare to do remotely with someone putting hands and eyes on the server. And the remote tools are *awful* at giving you BIOS access, often because the changes in screen resolution for different parts of the boot process confuse the remote console tools, at least if you use the standard VGA like access because you haven't set the console access because that *often requires someone to enable it from the BIOS*, which leads to a serious circular dependency.
Now scale by a stack of slightly different models of servers with diferent interfaces for their BIOS management, and you have a mess to manage. I *LOVE* environments where the admins have been able to insist on, or install, LinuxBIOS because this is *solved* there. You can get at it from Linux userland as necessary, they reboot *much* faster, and you can download and backup the configurations for system reporting. It's my friend.
Buy a Dell server with ECC. Don't install any software at all. Come ECC error, you'll have an orange blinky light immediately warning you of impending doom, and it'll even tell you on the front display details of the fault, including which DIMM needs replacing. If you can be bothered to install OMSA (run one command, one yum install), it'll drop you an email when if fails.
Dells are solid, server class machines. I've seen HP oversold with a lot of promises about management tools that don't work that well, for tasks better integrated and managed by userland tools that *have to be done anyway*, and sold with a lot of genuinely unnecessary features. (Whose bright idea was it to switch servers to laptop hard drives? E-e-e-e-e-w-w-w-w-w!!!"
Compared with not running with ECC, you wait until your machine randomly reboots. You ponder whether it's RAM/CPU/Motherboard. You just ignore it. It does it again. You then have a fun game of running memtest while pulling DIMMs out to try to work out which of the 16 are causing the issue. Joy unbounded.
ECC has a point, which I've acknowledged. But the overall "server class" hardware costs add up fast. SAS hard drives, 10Gig ethernet ports, dual power supplies, built-in remote KVM, expensive racking hardware, 15,000 RPM drives instead of 10,000 RPM, SAS instead of SATA, etc. all start adding up really fast when all you need is a so-called "pizza box".
This is one reason I've gotten fond of virtualization. (VMWare or VirtualBox for CentOS 5, we'll see about KVM for RHEL and CentOS 6). Amortizing the costs of a stack of modest servers with such server class features across one central, overpowered server and doling out environments as necessary is very efficient and avoids a lot of the hardware management problems.
And what do you mean about shortage of space for upgrades? What that has to do with ECC I'll have no idea.
It's the overall "enterprise class hardware" meme that I'm concerned about for a one-off CentOS grade server.
Pleae, name a single instance in the last 10 years where ECC demonstrably saved you work, especially if you made sure ti burn in the ssytem components on servers upon their first bootup...
I've had plenty of HPC nodes that have warned of corrected memory errors. I've been able to drop them out of the queues, get the memory fixed, and put them back into service without anyone noticing. Without ECC, I've potentially introduced errors into their results, and you're much more likely to get the first random reboot without warning, costing them time. I've had memory errors creep in after 4 years, it's not something that always bites at the beginning.
Are you sure it was fixed by memory replacement? Because I've seen most of my ECC reports as one-offs, never to recur again.
Equally I've had file servers do the same. Running a file server without ECC is a recipe for disaster, as you're risking silent data corruption.
Core file servers, I'd agree, although a lot of the more common problems (such as single very expensive fileserver failure and lack of user available snapshots) are ameliorated by other approaches. (Multiple cheap SATA external hard drives for snapshot backups, NFS access so the users can recover personally deleted files, single points of failure in upstream connectivity, etc.)
On Mon, 14 Feb 2011, Nico Kadel-Garcia wrote:
Trust me, it's a pain in the keister in production. If the standard is now enabled, good: I haven't had my hands inside a server in a year, I admit it. (My current role doesn't call for it.) It *didn't* used to be standard. Are you sure it is?
I buy whole machines not bits, and it's all preconfigured. I can't speak for the defaults in random motherboards.
I'm still seeing notes that the motherboards thtat support it are still significantly more expensive, "server grade". Unfortunately, I've worked for a manufacturer that repackaged consumer grade components for cheap pizza box servers, and we had some disagreements about where they cut corners.
There's a difference between high quality motherboards and motherboards advertised as high quality. But yes, you'll pay a bit more for ECC than not, but then I'll be paying more for dual PSU, and IPMI as well. But since I then don't need a IP-KVM or a controllable PDU it's worth the relatively small amount it costs.
It's very awkward to preserve BIOS settings across BIOS updates (read: impossible without a manual checklist) unless your environment is so sophisticated you're using LinuxBIOS.
Dell BIOS updates do not affect the settings, so it's quite easy.
Unless you've *really* invested and gotten remote KVM boxes or invested in Dell's DRAC or HP's remote console tools, *and set them up correctly at install time, and kept their network setups up to date*, they're a nightmare to do remotely with someone putting hands and eyes on the server. And the remote tools are *awful* at giving you BIOS access, often because the changes in screen resolution for different parts of the boot process confuse the remote console tools, at least if you use the standard VGA like access because you haven't set the console access because that *often requires someone to enable it from the BIOS*, which leads to a serious circular dependency.
Speaking for Dell here:
Generally speaking, get a machine that supports IPMI. A remote Serial-Over-LAN session can be initiated just nicely for editing bios settings if you need human driven remote BIOS tweaking. Same as you would if you were stood at it. If you have a Dell, syscfg lets you edit a large number of the BIOS settings from within linux, with an interface that doesn't vary between models. Also useful when you get a replacement motherboard / new machine as you can script it. That's all done through smbios as far as I know.
All the IPMI stuff is configurable either through IPMITool or OMSA. Through OMSA it's identical across at least the last 3 generations of servers, and nigh on identical through IPMITool.
Now scale by a stack of slightly different models of servers with diferent interfaces for their BIOS management, and you have a mess to manage. I *LOVE* environments where the admins have been able to insist on, or install, LinuxBIOS because this is *solved* there. You can get at it from Linux userland as necessary, they reboot *much* faster, and you can download and backup the configurations for system reporting. It's my friend.
Standardisation is great, so yes, I'd love something like LinuxBIOS across the board. But without something like this, it's still something you can cope with.
Dells are solid, server class machines. I've seen HP oversold with a lot of promises about management tools that don't work that well, for tasks better integrated and managed by userland tools that *have to be done anyway*, and sold with a lot of genuinely unnecessary features. (Whose bright idea was it to switch servers to laptop hard drives? E-e-e-e-e-w-w-w-w-w!!!"
I hope this isn't a general dig at 2.5" disks?
ECC has a point, which I've acknowledged. But the overall "server class" hardware costs add up fast. SAS hard drives, 10Gig ethernet ports, dual power supplies, built-in remote KVM, expensive racking hardware, 15,000 RPM drives instead of 10,000 RPM, SAS instead of SATA, etc. all start adding up really fast when all you need is a so-called "pizza box".
But you *are* adding on lots of extras there that don't come pre-bundled with ECC. Hey, my *desktop* has ECC memory...
This is one reason I've gotten fond of virtualization. (VMWare or VirtualBox for CentOS 5, we'll see about KVM for RHEL and CentOS 6). Amortizing the costs of a stack of modest servers with such server class features across one central, overpowered server and doling out environments as necessary is very efficient and avoids a lot of the hardware management problems.
Sure.
It's the overall "enterprise class hardware" meme that I'm concerned about for a one-off CentOS grade server.
CentOS grade?
Are you sure it was fixed by memory replacement? Because I've seen most of my ECC reports as one-offs, never to recur again.
Yes. Reset the counters, retripped the warning. Moved the DIMM, problem followed the DIMM. Replaced the DIMM, all well again.
Equally I've had file servers do the same. Running a file server without ECC is a recipe for disaster, as you're risking silent data corruption.
Core file servers, I'd agree, although a lot of the more common problems (such as single very expensive fileserver failure and lack of user available snapshots) are ameliorated by other approaches. (Multiple cheap SATA external hard drives for snapshot backups, NFS access so the users can recover personally deleted files, single points of failure in upstream connectivity, etc.)
Yes there are other requirements other than just sound hardware, but that doesn't mean sound hardware isn't a good starting point.
jh
LinuxBoot is now CoreBoot at http://www.coreboot.org/