-----Original Message----- From: Jason Pyeron Sent: Sunday, August 31, 2014 18:16
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of John R Pierce Sent: Sunday, August 31, 2014 17:34 To: centos@centos.org Subject: Re: [CentOS] Install Centos 6 x86_64 on Dell PowerEdge 2970 and aSSD (hardware probing issues)
On 8/31/2014 2:03 PM, Jason Pyeron wrote:
Yes. They support internal SATA drives, we are changing
from spinning drives to SSD. I am working with Dell to get a BIOS patch, but I wont hold my breath.
is the SATA interface in AHCI mode or legacy IDE emulation?
Good question, I will ask Dell. The BIOS only has Off and Auto as choices. Is there a preference I should shoot for?
So the dell tech says it only supports ATA (IDE) mode. [Sorry for the accidental forward]
Now I have to find an alternative to supporting a SSD boot device on a SATA port in IDE (ATA) mode.
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Jason Pyeron wrote:
-----Original Message----- From: Jason Pyeron Sent: Sunday, August 31, 2014 18:16
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of John R Pierce Sent: Sunday, August 31, 2014 17:34 To: centos@centos.org Subject: Re: [CentOS] Install Centos 6 x86_64 on Dell PowerEdge 2970 and aSSD (hardware probing issues)
On 8/31/2014 2:03 PM, Jason Pyeron wrote:
Yes. They support internal SATA drives, we are changing
from spinning drives to SSD. I am working with Dell to get a BIOS patch, but I wont hold my breath.
is the SATA interface in AHCI mode or legacy IDE emulation?
Good question, I will ask Dell. The BIOS only has Off and Auto as choices. Is there a preference I should shoot for?
So the dell tech says it only supports ATA (IDE) mode. [Sorry for the accidental forward]
Now I have to find an alternative to supporting a SSD boot device on a SATA port in IDE (ATA) mode.
Ok, I see - it's an old 2970 - I see the manuals on Dell's site were last revised in 2011. We got rid of all our 2950's (except for one, I think, or two, and they're another team's). IIRC, they did have a choice of AHCI or RAID, and I think there may have been one other option. Unless this is *really* old, I can't imagine that they actually have a physical IDE or EIDE interface, so there should be some way around this.
mark
-----Original Message----- From: m.roth@5-cent.us Sent: Friday, September 05, 2014 14:50
Jason Pyeron wrote:
-----Original Message----- From: Jason Pyeron Sent: Sunday, August 31, 2014 18:16
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of John R Pierce Sent: Sunday, August 31, 2014 17:34 To: centos@centos.org Subject: Re: [CentOS] Install Centos 6 x86_64 on Dell PowerEdge 2970 and aSSD (hardware probing issues)
On 8/31/2014 2:03 PM, Jason Pyeron wrote:
Yes. They support internal SATA drives, we are changing
from spinning drives to SSD. I am working with Dell to get a BIOS patch, but I wont hold my breath.
is the SATA interface in AHCI mode or legacy IDE emulation?
Good question, I will ask Dell. The BIOS only has Off and Auto as choices. Is there a preference I should shoot for?
So the dell tech says it only supports ATA (IDE) mode.
[Sorry for the
accidental forward]
Now I have to find an alternative to supporting a SSD boot
device on a
SATA port in IDE (ATA) mode.
Ok, I see - it's an old 2970 - I see the manuals on Dell's site were last revised in 2011. We got rid of all our 2950's (except for one, I think, or two, and they're another team's). IIRC, they did have a choice of AHCI or RAID, and I think there may have been one other option. Unless this is
I think that is on the PERC contoller. The Onboard SATA A/B ports are the issue.
*really* old, I can't imagine that they actually have a physical IDE or EIDE interface, so there should be some way around this.
We have some with 40 pin IDE, but I am ignoring them.
Both IDE and SATA mother boards have the same BIOS version!?!?!
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Jason Pyeron wrote:
From: m.roth@5-cent.us Jason Pyeron wrote:
From: Jason Pyeron
[mailto:centos-bounces@centos.org] On Behalf Of John R Pierce On 8/31/2014 2:03 PM, Jason Pyeron wrote:
Yes. They support internal SATA drives, we are changing
from spinning drives to SSD. I am working with Dell to get a BIOS patch, but I wont hold my breath.
Dumb question: these machines are getting very long in the tooth, but you're putting SSD's in them? New, or newer machines, would solve a lot of problems....
is the SATA interface in AHCI mode or legacy IDE emulation?
Good question, I will ask Dell. The BIOS only has Off and Auto as choices. Is there a preference I should shoot for?
So the dell tech says it only supports ATA (IDE) mode.
[Sorry for the
accidental forward]
Now I have to find an alternative to supporting a SSD boot device on a SATA port in IDE (ATA) mode.
Ok, I see - it's an old 2970 - I see the manuals on Dell's site were last revised in 2011. We got rid of all our 2950's (except for one, I think, or two, and they're another team's). IIRC, they did have a choice of AHCI or RAID, and I think there may have been one other option. Unless this is
I think that is on the PERC contoller. The Onboard SATA A/B ports are the issue.
Nope. That's the kind of stuff that's only in the BIOS - it's certainly not on a PERC. <snip>
We have some with 40 pin IDE, but I am ignoring them.
And to that I have one response: MTBF. You need to talk to management about spending some money....
Both IDE and SATA mother boards have the same BIOS version!?!?!
Presumably from when the switchover was happening.
Hmmmm... have you spoken to Dell, or looked on their website, for a firmware update for the BIOS?
mark
-----Original Message----- From: m.roth@5-cent.us Sent: Friday, September 05, 2014 15:19 To: CentOS mailing list
Jason Pyeron wrote:
From: m.roth@5-cent.us Jason Pyeron wrote:
From: Jason Pyeron
[mailto:centos-bounces@centos.org] On Behalf Of John R Pierce On 8/31/2014 2:03 PM, Jason Pyeron wrote:
> Yes. They support internal SATA drives, we are changing from spinning drives to SSD. I am working with Dell to get a BIOS patch, but I wont hold my breath.
Dumb question: these machines are getting very long in the tooth, but you're putting SSD's in them? New, or newer machines, would
32GB SSD for the boot device, not on the raid arrays.
solve a lot of problems....
Their warrantees are good for another few years... And the money is not :)
is the SATA interface in AHCI mode or legacy IDE emulation?
Good question, I will ask Dell. The BIOS only has Off and Auto as choices. Is there a preference I should shoot for?
So the dell tech says it only supports ATA (IDE) mode.
[Sorry for the
accidental forward]
Now I have to find an alternative to supporting a SSD boot device on a SATA port in IDE (ATA) mode.
Ok, I see - it's an old 2970 - I see the manuals on Dell's site were last revised in 2011. We got rid of all our
2950's (except for
one, I think, or two, and they're another team's). IIRC,
they did have a
choice of AHCI or RAID, and I think there may have been
one other option.
Unless this is
I think that is on the PERC contoller. The Onboard SATA A/B
ports are the
issue.
Nope. That's the kind of stuff that's only in the BIOS - it's certainly not on a PERC.
Will go over it again with a fine tooth comb.
<snip> > We have some with 40 pin IDE, but I am ignoring them.
And to that I have one response: MTBF. You need to talk to management about spending some money....
Step 1. Make more money. Step 2. Replace 40 of them when the support contract expires.
Both IDE and SATA mother boards have the same BIOS version!?!?!
Presumably from when the switchover was happening.
Hmmmm... have you spoken to Dell, or looked on their website, for a firmware update for the BIOS?
Running the latest BIOS.
mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Jason Pyeron wrote:
From: m.roth@5-cent.us
<snip>
Dumb question: these machines are getting very long in the tooth, but you're putting SSD's in them? New, or newer machines, would
solve a lot of problems....
Their warrantees are good for another few years... And the money is not :)
<snip> Warning: danger, Will Robinson.
One of the main things that pushed us to surplus ours was interesting: inside of a month, 4? 5? more? of them had the PERC fail, fatally. Amazing quality control (and they were in about three different rooms, including one or more in the datacenter, so it wasn't the environment).
Refurbed machines are also an option....
mark
Hey just coming into this conversation. Here is an Idea.. Why not install a SATA card into the machine, one that supports AHCI. I'm guessing there is a free PCI or PCI-E slot.
They are made, here is a link, I found quickly with a google search.. Bang for buck, it could be the cheapest option.
http://www.lycom.com.tw/PE-126.htm
http://www.lycom.com.tw/PE-125.htm ( Better card )
It could save a bunch of headaches.
On your 2970 which series are you running? II or III
John
By the bye, about firmware updates: I like Dell's the best of all. HP, run it from some kind of DOS, and hope. Dell, you can do from a running CentOS system (I've done it a few times), and unlike everyone else's firmware updates, it says, "collecting information", then *tells* you that a) this update is, in fact, for this hardware (and so won't brick it), and b) whether it's newer than what's installed.
mark
On Fri, September 5, 2014 2:20 pm, m.roth@5-cent.us wrote:
By the bye, about firmware updates: I like Dell's the best of all. HP, run it from some kind of DOS, and hope. Dell, you can do from a running CentOS system (I've done it a few times), and unlike everyone else's firmware updates, it says, "collecting information", then *tells* you that a) this update is, in fact, for this hardware (and so won't brick it), and b) whether it's newer than what's installed.
I was always fascinated: why [some] people are dying to upgrade firmware? It doesn't matter whether by firmware you mean system board BIOS, or firmware of some card. Why taking chance having your machine hosed? If current firmware version is crap, then you shouldn't buy any hardware by this manufacturer in the first place. If current version is OK, why bother re-flashing and taking chance to kill the [whatever] board. Beats me. The only time I felt it justified was when new firmware [of 3ware RAID adapter] was adding support for hard drives above 2TB capacity.
Can anybody offer an argument that can change my mind?
Valeri
PS of course, yo can be that rich and have 3 redundant machines for everything, so you wouldn't care about each particular one ;-)
mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On Sat, Sep 6, 2014 at 9:34 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I was always fascinated: why [some] people are dying to upgrade firmware? It doesn't matter whether by firmware you mean system board BIOS, or firmware of some card. Why taking chance having your machine hosed?
Because BIOS updates often fix corner case issues/bugs. The BIOS release notes for this PowerEdge 2970 server: http://downloads.dell.com/bios/PE2970-040201BIOS.txt includes: * Fixed intermittent SATA Drive B not found error.
The likelihood of a BIOS upgrade going bad if due diligence is done to verify the BIOS upgrade is for that hardware is practically zero.
On Sat, September 6, 2014 9:21 am, Steven Tardy wrote:
On Sat, Sep 6, 2014 at 9:34 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I was always fascinated: why [some] people are dying to upgrade firmware? It doesn't matter whether by firmware you mean system board BIOS, or firmware of some card. Why taking chance having your machine hosed?
Because BIOS updates often fix corner case issues/bugs. The BIOS release notes for this PowerEdge 2970 server: http://downloads.dell.com/bios/PE2970-040201BIOS.txt includes:
- Fixed intermittent SATA Drive B not found error.
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by _this_ manufacturer in a future. I'm still not convinced. Any better reasons?
Valeri
The likelihood of a BIOS upgrade going bad if due diligence is done to verify the BIOS upgrade is for that hardware is practically zero. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 9/6/2014 7:46 AM, Valeri Galtsev wrote:
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by_this_ manufacturer in a future. I'm still not convinced. Any better reasons?
with that approach, you'd quickly find yourself with zero vendors left.
On Sat, September 6, 2014 10:07 am, John R Pierce wrote:
On 9/6/2014 7:46 AM, Valeri Galtsev wrote:
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by_this_ manufacturer in a future. I'm still not convinced. Any better reasons?
with that approach, you'd quickly find yourself with zero vendors left.
No, I'm still buying Dell desktops. I gave up on their rackmount boxes (with Dell you don't have flexibility of choice of your preferred, say, RAID cards: step left, step right and you are shot ;-). I get rackmount ones assembled by small company (companies) and about 1/2 of cost of similar hardware from Dell. Those are for the most part based on Tyan barebones. And during last at least decade I never had a "must to" flash newer BIOS situation with any of those boxes. If I ever flashed new BIOS, it was only once before I put the box in production.
Of course, I flashed BIOS of my laptop (after unsoldering EPROM, dumping original BIOS content, editing the darn thing with hex editor, flashing it on new EPROM chip, and then sticking it into socket I soldered to laptop system board in place of EPROM chip), but that is different. The World will not stop spinning if my laptop stays dead for a couple of days, or weeks or forever. Any of the servers. - its way different.
Valeri
-- john r pierce 37N 122W somewhere on the middle of the left coast
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 2014-09-06, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I get rackmount ones assembled by small company (companies) and about 1/2 of cost of similar hardware from Dell. Those are for the most part based on Tyan barebones. And during last at least decade I never had a "must to" flash newer BIOS situation with any of those boxes.
You have been lucky, then. I agree that flashing the firmware should be a rare event, but expecting the rate to be exactly 0 is an unreasonable expectation.
I have had to flash a BIOS once, and a BMC once, in about 10 years of buying server hardware. (Yes, flashing a BMC probably wouldn't brick a box, but it'd brick getting a remote console, which for me is almost as serious.) I consider that an acceptable bug rate.
Flashing a RAID controller is actually more frightening to me--flashing the BIOS isn't likely to hose your data, but an undetected bad flash on a RAID controller could. Sadly my flash rate of my controllers is slightly higher than my BIOS flash rate.
--keith
On Sat, September 6, 2014 2:16 pm, Keith Keller wrote:
On 2014-09-06, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I get rackmount ones assembled by small company (companies) and about 1/2 of cost of similar hardware from Dell. Those are for the most part based on Tyan barebones. And during last at least decade I never had a "must to" flash newer BIOS situation with any of those boxes.
You have been lucky, then.
... I've mentined manufacturers in another reply: tyan, lsi, 3ware, ati...
I agree that flashing the firmware should be
a rare event, but expecting the rate to be exactly 0 is an unreasonable expectation.
I have had to flash a BIOS once, and a BMC once, in about 10 years of buying server hardware.
Great, I'm happy to be on the same page with you!
(Yes, flashing a BMC probably wouldn't brick a
box, but it'd brick getting a remote console, which for me is almost as serious.) I consider that an acceptable bug rate.
Flashing a RAID controller is actually more frightening
I meant raid controller off the shelf, before you start using it on the new box you are building with newly released 3TB drives... I'm still on the same page with you ;-)
Valeri
to me--flashing
the BIOS isn't likely to hose your data, but an undetected bad flash on a RAID controller could. Sadly my flash rate of my controllers is slightly higher than my BIOS flash rate.
--keith
-- kkeller@wombat.san-francisco.ca.us
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 9/6/2014 1:53 PM, Valeri Galtsev wrote:
... I've mentined manufacturers in another reply: tyan, lsi, 3ware, ati...
A few months ago, I had to flash the firmware on a LSI 2008 aka 9211-8i because I needed the card in "IT" (Initiator Target) mode rather than "IR" (Integrated Raid), and this requires different firmware AND card bios. This was surprisingly difficult to accomplish as the system had a UEFI BIOS, and on that the LSI logic firmware flasher wouldn't operate in MSDOS, I had to discover and utilize this bizarro-world known as the UEFI Shell to run the firmware flash utility.
On Sat, September 6, 2014 4:52 pm, John R Pierce wrote:
On 9/6/2014 1:53 PM, Valeri Galtsev wrote:
... I've mentioned manufacturers in another reply: tyan, lsi, 3ware, ati...
A few months ago, I had to flash the firmware on a LSI 2008 aka 9211-8i because I needed the card in "IT" (Initiator Target) mode rather than "IR" (Integrated Raid), and this requires different firmware AND card bios. This was surprisingly difficult to accomplish as the system had a UEFI BIOS, and on that the LSI logic firmware flasher wouldn't operate in MSDOS, I had to discover and utilize this bizarro-world known as the UEFI Shell to run the firmware flash utility.
That doesn't mean that you have to flash firmware onto LSI controller every so often after you placed controller into production because original version of firmware is crap, and updated version will turn out to be crap several Months after its release, and so on. You did have nice thing before you flashed, which alas was different hardware from what you needed. You flashed different version to modify hardware. And after that the hardware was exactly what you needed it to be. And from this point on you don't need to flash it, unless you decide to change its functions to what they were with original firmware.
This whole thing is way different from what I originally was displeased (i.e. the "necessity" to apply updates to firmware to fix the thing that appears to be broken with older crappy version of firmware).
So: LSI still is in my list of great hardware manufacturers. (Even though my favorite is 3ware, I forgot to mention one other good one: areca, whose place will be after lsi in my book). And I don't care how hard it is to flash LSI card (which you had to do _before_ you placed it into production). In worst case scenario you could hire someone to do it for you. After doing it yourself you can become extremely proud of yourself: now you know that you are worth of your salary. But certainly you knew it before that ;-)
Valeri
-- john r pierce 37N 122W somewhere on the middle of the left coast
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 9/6/2014 4:02 PM, Valeri Galtsev wrote:
That doesn't mean that you have to flash firmware onto LSI controller every so often after you placed controller into production because original version of firmware is crap, and updated version will turn out to be crap several Months after its release, and so on. You did have nice thing before you flashed, which alas was different hardware from what you needed. You flashed different version to modify hardware. And after that the hardware was exactly what you needed it to be. And from this point on you don't need to flash it, unless you decide to change its functions to what they were with original firmware.
for some unfathomable reason, IT (initiator-terminator) internal SAS cards are nearly unobtanium. The external ones cost stupid money, at least as expensive as high end SAS raid cards, and I really don't understand it.
ok, I do understand it... MS Windows prefers using hardware raid since the built in storage management is dreadful.
On 2014-09-06, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
... I've mentined manufacturers in another reply: tyan, lsi, 3ware, ati...
Even 3ware has had buggy firmwares. I once had to flash a 3ware card years into production because it was not until then that this particular bug was exposed by my configuration.
--keith
On Sun, September 7, 2014 1:35 am, Keith Keller wrote:
On 2014-09-06, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
... I've mentined manufacturers in another reply: tyan, lsi, 3ware, ati...
Even 3ware has had buggy firmwares. I once had to flash a 3ware card years into production because it was not until then that this particular bug was exposed by my configuration.
It doesn't sound like you are flashing all 3ware cards you have in production every time new firmware release it out. It doesn't sound either like you had fatal failure of production box because of bug in 3ware firmware. Correct me if I'm wrong, otherwise I see you on the same page with me: i.e. not flashing new firmware as a part of "routine update" of production machine (together with system/software updates).
Valeri
--keith
-- kkeller@wombat.san-francisco.ca.us
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 2014-09-07, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
It doesn't sound like you are flashing all 3ware cards you have in production every time new firmware release it out. It doesn't sound either like you had fatal failure of production box because of bug in 3ware firmware. Correct me if I'm wrong, otherwise I see you on the same page with me: i.e. not flashing new firmware as a part of "routine update" of production machine (together with system/software updates).
Well, I think we are on the same page now. I think I (and some other folks) interpreted your posts as "if you have to flash the firmware, it was a crappy firmware, and you should switch vendors" which (as someone else noted) would soon leave you with no vendors.
To summarize, I think our page says "update the firmware only when necessary on production-level hardware".
FWIW, I did have a different 3ware card eat its array, though I do suspect some user (i.e., me) error. I had a 9650 card which was having problems with kernel panics. I suspected a hardware failure, so I moved the array to another 9650 in the same box, which may not have had a BBU. Unfortunately that card showed worse problems a few weeks later: not only did it kernel panic, but it also trashed the array pretty much completely. (Of course I had backups, and this was a dev box, not public-facing, but it was still frustrating.) At the time the 9650 was old enough that the 9750 series was out, and that card has been fairly solid. (Also FWIW, my last 9650 card had the same issue a few weeks ago; fortunately it did not eat its array.)
So to add a page to our book, "always have backups even if you trust your hardware!" :)
--keith
On Sun, September 7, 2014 1:04 pm, Keith Keller wrote:
On 2014-09-07, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
It doesn't sound like you are flashing all 3ware cards you have in production every time new firmware release it out. It doesn't sound either like you had fatal failure of production box because of bug in 3ware firmware. Correct me if I'm wrong, otherwise I see you on the same page with me: i.e. not flashing new firmware as a part of "routine update" of production machine (together with system/software updates).
Well, I think we are on the same page now. I think I (and some other folks) interpreted your posts as "if you have to flash the firmware, it was a crappy firmware, and you should switch vendors" which (as someone else noted) would soon leave you with no vendors.
Great... and my fault, I'm often a bit extreme in expressions ;-(
To summarize, I think our page says "update the firmware only when necessary on production-level hardware".
Yes. Of which during last one and a half decades I had none.
FWIW, I did have a different 3ware card eat its array, though I do suspect some user (i.e., me) error. I had a 9650 card which was having problems with kernel panics. I suspected a hardware failure, so I moved the array to another 9650 in the same box, which may not have had a BBU. Unfortunately that card showed worse problems a few weeks later: not only did it kernel panic, but it also trashed the array pretty much completely. (Of course I had backups, and this was a dev box, not public-facing, but it was still frustrating.) At the time the 9650 was old enough that the 9750 series was out, and that card has been fairly solid. (Also FWIW, my last 9650 card had the same issue a few weeks ago; fortunately it did not eat its array.)
I guess after that I should declare myself to be lucky. None out of more than a couple of dozens of 3ware cards ever did harm for me. I did once had one of them fried (my clumsiness most likely), which then just didn't come up (3ware just replaced card without a question asked). Could yours be _slightly_ fried? If its internal RAM controller chip that is slightly fried (if you overheat it extremely it may become less high frequency due to impurity diffusion in the chip messing up profile - I've seen things like that, not in 3ware though) - then the card's internal computer (doing RAID function) will produce total garbage occasionally thus potentially causing anything. And kernel panics with that card would be likely sometimes, as it will occasionally talk gibberish back to the kernel. Just a shot in the dark.
Valeri
So to add a page to our book, "always have backups even if you trust your hardware!" :)
--keith
-- kkeller@wombat.san-francisco.ca.us
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 2014-09-07, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I guess after that I should declare myself to be lucky. None out of more than a couple of dozens of 3ware cards ever did harm for me. I did once had one of them fried (my clumsiness most likely), which then just didn't come up (3ware just replaced card without a question asked). Could yours be _slightly_ fried?
The first card could have been slightly fried; it came back up after a reboot, and would kernel panic again within a few days. Since I had what I thought was a good second card I never bothered to test the first one thoroughly. After the second card ate the array I bailed on the old cards completely; had the 9650 been easy to obtain I would have, but it was pretty much EOL by then.
The 9650 that died last month refused to be recognized on cold boot, so I think it's totally gone. It's old enough that it's not worth my time trying to figure out whether it's revivable.
--keith
On Sun, September 7, 2014 8:55 pm, Keith Keller wrote:
On 2014-09-07, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I guess after that I should declare myself to be lucky. None out of more than a couple of dozens of 3ware cards ever did harm for me. I did once had one of them fried (my clumsiness most likely), which then just didn't come up (3ware just replaced card without a question asked). Could yours be _slightly_ fried?
The first card could have been slightly fried; it came back up after a reboot, and would kernel panic again within a few days. Since I had what I thought was a good second card I never bothered to test the first one thoroughly. After the second card ate the array I bailed on the old cards completely; had the 9650 been easy to obtain I would have, but it was pretty much EOL by then.
The 9650 that died last month refused to be recognized on cold boot, so I think it's totally gone. It's old enough that it's not worth my time trying to figure out whether it's revivable.
Indeed, lucky me. As of this moment I have 6 of 9650 in production boxes. For at least 6 years. During which time none of them ever failed on me (including any trouble with arrays). Knocking on wood. I must say though that I do prefer the most reliable drives. And I always have arrays checked at least once a week through 3ware scheduler (this causes walk through the whole surface of each of drives, thus ensuring bad blocks if any do not stay undiscovered...).
Valeri
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 2014-09-08, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
Indeed, lucky me. As of this moment I have 6 of 9650 in production boxes. For at least 6 years. During which time none of them ever failed on me (including any trouble with arrays). Knocking on wood.
You totally jinxed them! You'll probably have three of them fail in the next month. ;-)
I must say though that I do prefer the most reliable drives. And I always have arrays checked at least once a week through 3ware scheduler (this causes walk through the whole surface of each of drives, thus ensuring bad blocks if any do not stay undiscovered...).
I decided to do verifies once a month, instead of the default once a week. My thinking was that hitting all of every drive so frequently might be a wear factor, but periodic scrubs are still important; plus, for my larger arrays, having performance slightly degraded 1/7 of the time was not so desirable. So now I do 12 verifies a year instead of 52. I think the verify has picked up maybe two errors in ~10 years. (It does sometimes expose when a drive is failing, which is handy.)
The LSI MegaRAID cards have their own scheduling, and I haven't had time enough to read the manual to figure out how to set this (or indeed, even to figure out what the schedule is; the MegaRAID UI is much more arcane than the 3ware).
--keith
On Sun, September 7, 2014 9:55 pm, Keith Keller wrote:
On 2014-09-08, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
Indeed, lucky me. As of this moment I have 6 of 9650 in production boxes. For at least 6 years. During which time none of them ever failed on me (including any trouble with arrays). Knocking on wood.
You totally jinxed them! You'll probably have three of them fail in the next month. ;-)
I must say though that I do prefer the most reliable drives. And I always have arrays checked at least once a week through 3ware scheduler (this causes walk through the whole surface of each of drives, thus ensuring bad blocks if any do not stay undiscovered...).
I decided to do verifies once a month, instead of the default once a week. My thinking was that hitting all of every drive so frequently might be a wear factor, but periodic scrubs are still important; plus, for my larger arrays, having performance slightly degraded 1/7 of the time was not so desirable. So now I do 12 verifies a year instead of 52. I think the verify has picked up maybe two errors in ~10 years.
Then you too manage to stick with reliable drives! Performance... well, I have array verify scheduled to start 23:00 on Saturday... also, on newer controllers you can choose policy with highest priority for IO and lowest for rebuild etc. I'm kind of sceptical about extra wear to mechanical drives added by RAID check: my drives are always spinning full speed (no spin down crap, thank you), and heads are not touching the platters... So, I'm not certain there is extra wear. The only wear apart from bearings I know about is when arm hits the stopper which only happens when you power off the drive, which not the case here too. What do I miss?
Valeri
(It does sometimes expose when a drive is failing, which is handy.)
The LSI MegaRAID cards have their own scheduling, and I haven't had time enough to read the manual to figure out how to set this (or indeed, even to figure out what the schedule is; the MegaRAID UI is much more arcane than the 3ware).
--keith
-- kkeller@wombat.san-francisco.ca.us
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 2014-09-08, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
Then you too manage to stick with reliable drives! Performance... well, I have array verify scheduled to start 23:00 on Saturday... also, on newer controllers you can choose policy with highest priority for IO and lowest for rebuild etc. I'm kind of sceptical about extra wear to mechanical drives added by RAID check: my drives are always spinning full speed (no spin down crap, thank you), and heads are not touching the platters... So, I'm not certain there is extra wear. The only wear apart from bearings I know about is when arm hits the stopper which only happens when you power off the drive, which not the case here too. What do I miss?
I learned the "reliable drive" lesson the hard way, when I accidentally ordered, and then unwisely decided to use, crappy drives. Fortunately no data was ever lost, but the drives made a lot of extra work for me.
You may be 100% correct about wear. I have no actual evidence that extra verifies causes more wear than normal use.
For my largest arrays, even if I started at 23:00 on Saturday, it would still be going through much of Sunday. Plus I have other disk operations that occur overnight. The performance loss isn't terrible during a verify, but I have noticed it enough that I prefer to avoid it happening so frequently.
I've heard quite a few horror stories from people who never did a scrub of their arrays. But I haven't heard any from people who did scrubs less frequently than weekly. Has anyone? I'm genuinely curious; if there's a justification for weekly scrubs that's stronger than my fairly weak justifications for monthly, I'd switch back.
--keith
On Sep 7, 2014, at 1:35 AM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:
On 2014-09-06, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
... I've mentined manufacturers in another reply: tyan, lsi, 3ware, ati...
Even 3ware has had buggy firmwares. I once had to flash a 3ware card years into production because it was not until then that this particular bug was exposed by my configuration.
This is why I would say that firmware updates are part of the preventative maintenance in the same way kernel updates are, if the bug was already fixed and if you had flashed this during a normal maintenance window you never would have had an unplanned maintenance to fix or recover from the problem. Manufacturers don’t tend to update firmware without real bug reports from the field, why wait until you’ve had a failure due to some already-fixed corner case.
— Mark Tinberg mtinberg@wisc.edu
On 2014-09-08, Mark Tinberg mtinberg@wisc.edu wrote:
On Sep 7, 2014, at 1:35 AM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:
This is why I would say that firmware updates are part of the preventative maintenance in the same way kernel updates are, if the bug was already fixed and if you had flashed this during a normal maintenance window you never would have had an unplanned maintenance to fix or recover from the problem. Manufacturers don?t tend to update firmware without real bug reports from the field, why wait until you?ve had a failure due to some already-fixed corner case.
I do actually read the release notes (sporadically) for my controller firmware releases, at least (usually not BIOS, I admit). If I think I might hit one of their bug fixes then I will schedule an update.
--keith
On Sat, Sep 06, 2014 at 09:46:36AM -0500, Valeri Galtsev wrote:
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by _this_ manufacturer in a future. I'm still not convinced. Any better reasons?
In my experience, all code has bugs. Instead of trying to find some vendor that has magically released hardware with bug-free firmware, I choose vendors that make it relatively painless to apply the firmware updates under Linux.
On Sat, September 6, 2014 2:27 pm, Jonathan Billings wrote:
On Sat, Sep 06, 2014 at 09:46:36AM -0500, Valeri Galtsev wrote:
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by _this_ manufacturer in a future. I'm still not convinced. Any better reasons?
In my experience, all code has bugs. Instead of trying to find some vendor that has magically released hardware with bug-free firmware,
I've found a few: tyan for system board, 3ware and LSI for raid controller, ATI for video card...
I
choose vendors that make it relatively painless to apply the firmware updates under Linux.
This is only so for either very rich, who can afford to have stand by hardware to replace bricked by flashing box, or very happy to the level they don't care that the box will not come back up in next 5 min. I am definitely neither of two...
Valeri
-- Jonathan Billings billings@negate.org _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On Sep 6, 2014, at 3:42 PM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Sat, September 6, 2014 2:27 pm, Jonathan Billings wrote:
I choose vendors that make it relatively painless to apply the firmware updates under Linux.
This is only so for either very rich, who can afford to have stand by hardware to replace bricked by flashing box, or very happy to the level they don't care that the box will not come back up in next 5 min. I am definitely neither of two…
I’ve used mostly Dell and have done a thousand firmware updates in my time and I’ve never seen a piece of hardware bricked, their update system takes all due precaution, so the problem just isn’t as dire as you make it out to be, even anecdotally it is statistically improbable, either I am a massive outlier or you are way overestimating.
— Mark Tinberg mtinberg@wisc.edu
On Mon, September 8, 2014 9:19 am, Mark Tinberg wrote:
On Sep 6, 2014, at 3:42 PM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Sat, September 6, 2014 2:27 pm, Jonathan Billings wrote:
I choose vendors that make it relatively painless to apply the firmware updates under Linux.
This is only so for either very rich, who can afford to have stand by hardware to replace bricked by flashing box, or very happy to the level they don't care that the box will not come back up in next 5 min. I am definitely neither of two
Ive used mostly Dell and have done a thousand firmware updates in my time and Ive never seen a piece of hardware bricked, their update system takes all due precaution, so the problem just isnt as dire as you make it out to be, even anecdotally it is statistically improbable, either I am a massive outlier or you are way overestimating.
Certainly the last: it is me who is scared to take 1:10000 chance. Speaking of Dell: we use only lowest end of their boxes (think 32 GB quad core CPU _Desktop_ today) which are en par with others price wise, yet very reliable. Never had to flash BIOS on these, and never had one failed because of me not having BIOS updates flashed routinely... As far as servers are concerned, these are tyan mostly. I do not re-flash their BIOS routinely. (And I doubt they release tons of BIOS upgrades, at the very most one per board during its lifetime which never sounded "do it or your box is dead tomorrow".) So, I maybe flashed BIOS 3-4 times per maybe 200 machines... Never had box bricked due to flashing. (Still...) And never had failure due to not doing "preventive" re-flashing. But after all: maybe I'm just extremely lucky ;-) and at the same time awfully scared (to fix something that shouldn't be broken IMHO).
Valeri
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On Sep 8, 2014, at 11:57 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Mon, September 8, 2014 9:19 am, Mark Tinberg wrote:
On Sep 6, 2014, at 3:42 PM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Sat, September 6, 2014 2:27 pm, Jonathan Billings wrote:
I choose vendors that make it relatively painless to apply the firmware updates under Linux.
This is only so for either very rich, who can afford to have stand by hardware to replace bricked by flashing box, or very happy to the level they don't care that the box will not come back up in next 5 min. I am definitely neither of two…
I’ve used mostly Dell and have done a thousand firmware updates in my time and I’ve never seen a piece of hardware bricked, their update system takes all due precaution, so the problem just isn’t as dire as you make it out to be, even anecdotally it is statistically improbable, either I am a massive outlier or you are way overestimating.
Certainly the last: it is me who is scared to take 1:10000 chance. Speaking of Dell: we use only lowest end of their boxes (think 32 GB quad core CPU _Desktop_ today) which are en par with others price wise, yet very reliable. Never had to flash BIOS on these, and never had one failed because of me not having BIOS updates flashed routinely... As far as servers are concerned, these are tyan mostly. I do not re-flash their BIOS routinely. (And I doubt they release tons of BIOS upgrades, at the very most one per board during its lifetime which never sounded "do it or your box is dead tomorrow".) So, I maybe flashed BIOS 3-4 times per maybe 200 machines... Never had box bricked due to flashing. (Still...) And never had failure due to not doing "preventive" re-flashing. But after all: maybe I'm just extremely lucky ;-) and at the same time awfully scared (to fix something that shouldn't be broken IMHO).
My sense that the bugs which are being fixed by the server firmware vendors updates are very very rare but they have big enough customers who demand fixes to spend the engineering effort in finding and fixing these subtle issues whereas the whitebox vendors don’t sell enough of a single model and don’t have the high-touch relationship with customers to really care, there are probably just as many subtle bugs in their designs but they are focused on getting the next motherboard manufactured, not fixing problems with last years model that they don’t sell anymore.
— Mark Tinberg mtinberg@wisc.edu
On Tue, September 9, 2014 9:37 am, Mark Tinberg wrote:
On Sep 8, 2014, at 11:57 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Mon, September 8, 2014 9:19 am, Mark Tinberg wrote:
On Sep 6, 2014, at 3:42 PM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Sat, September 6, 2014 2:27 pm, Jonathan Billings wrote:
I choose vendors that make it relatively painless to apply the firmware updates under Linux.
This is only so for either very rich, who can afford to have stand by hardware to replace bricked by flashing box, or very happy to the level they don't care that the box will not come back up in next 5 min. I am definitely neither of two
Ive used mostly Dell and have done a thousand firmware updates in my time and Ive never seen a piece of hardware bricked, their update system takes all due precaution, so the problem just isnt as dire as you make it out to be, even anecdotally it is statistically improbable, either I am a massive outlier or you are way overestimating.
Certainly the last: it is me who is scared to take 1:10000 chance. Speaking of Dell: we use only lowest end of their boxes (think 32 GB quad core CPU _Desktop_ today) which are en par with others price wise, yet very reliable. Never had to flash BIOS on these, and never had one failed because of me not having BIOS updates flashed routinely... As far as servers are concerned, these are tyan mostly. I do not re-flash their BIOS routinely. (And I doubt they release tons of BIOS upgrades, at the very most one per board during its lifetime which never sounded "do it or your box is dead tomorrow".) So, I maybe flashed BIOS 3-4 times per maybe 200 machines... Never had box bricked due to flashing. (Still...) And never had failure due to not doing "preventive" re-flashing. But after all: maybe I'm just extremely lucky ;-) and at the same time awfully scared (to fix something that shouldn't be broken IMHO).
My sense that the bugs which are being fixed by the server firmware vendors updates are very very rare but they have big enough customers who demand fixes to spend the engineering effort in finding and fixing these subtle issues whereas the whitebox vendors dont sell enough of a single model and dont have the high-touch relationship with customers to really care, there are probably just as many subtle bugs in their designs but they are focused on getting the next motherboard manufactured, not fixing problems with last years model that they dont sell anymore.
Maybe. Still I wouldn't place Tyan into "small volume" manufacturers. My _feeling_ is that they work more on BIOS debugging, and they do not make tremendous changes in BIOS from one model to its successor, only the necessary ones. Not experimenting. Thus less chance to introduce new bugs. After all, they are in a [small] server boards business forever. But that is just my _feeling_ from my experience dealing with variety (their and others) hardware. FWIW.
I'm really happy to see we are on the same page about (virtual absence of) vital flaws necessary to be fixes yesterday in decent server boards...
Valeri
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On Sep 6, 2014, at 2:27 PM, Jonathan Billings billings@negate.org wrote:
On Sat, Sep 06, 2014 at 09:46:36AM -0500, Valeri Galtsev wrote:
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by _this_ manufacturer in a future. I'm still not convinced. Any better reasons?
In my experience, all code has bugs. Instead of trying to find some vendor that has magically released hardware with bug-free firmware, I choose vendors that make it relatively painless to apply the firmware updates under Linux.
A lack of updates can also mean that there is a lack of effort or competence is tracking down and fixing bugs, or not a large enough customer base with the same bugs to generate sufficient, actionable, bug reports, it is not necessarily or even primarily a signal of quality.
There is little churn in firmware updates, changes for changes sake, pretty much every published update fixes some real corner case that someone ran into, and I’d rather fix it on my system before running into it in production, there are fewer stupider feelings than having something go sideways due a bug that someone already fixed for you.
— Mark Tinberg mtinberg@wisc.edu
Mark Tinberg wrote:
On Sat, Sep 06, 2014 at 09:46:36AM -0500, Valeri Galtsev wrote:
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by _this_ manufacturer in a future. I'm still not convinced. Any better reasons?
In my experience, all code has bugs. Instead of trying to find some vendor that has magically released hardware with bug-free firmware, I choose vendors that make it relatively painless to apply the firmware updates under Linux.
A lack of updates can also mean that there is a lack of effort or
competence
is tracking down and fixing bugs, or not a large enough customer base with the same bugs to generate sufficient, actionable, bug reports, it is not necessarily or even primarily a signal of quality.
I might also point out that there are really *not* a lot of BIOS manufacturers. AMI, and - is Phoenix still doing them? - and Dell claims it's got its own, but who knows what they've rebranded. Once you consider that, then you need to consider the board maker. Some seem to do a lot better job of qa/qc than others. For example, some folks here like Supermicro, where I *REALLY* don't - many of our Penguins, which are rebranded SuperMicro, have a *lot* of issues with the m/b.
mark
mark
On Mon, September 8, 2014 9:48 am, m.roth@5-cent.us wrote:
Mark Tinberg wrote:
On Sat, Sep 06, 2014 at 09:46:36AM -0500, Valeri Galtsev wrote:
But that is exactly what I said: if the hardware was released and sold with this piece of crap BIOS, then you shouldn't be buying that junk in the first place. Or at least stop buying the crap made by _this_ manufacturer in a future. I'm still not convinced. Any better reasons?
In my experience, all code has bugs. Instead of trying to find some vendor that has magically released hardware with bug-free firmware, I choose vendors that make it relatively painless to apply the firmware updates under Linux.
A lack of updates can also mean that there is a lack of effort or
competence
is tracking down and fixing bugs, or not a large enough customer base with the same bugs to generate sufficient, actionable, bug reports, it is not necessarily or even primarily a signal of quality.
You may be right. But in many cases you may be wrong. I'm stealing someone's else example (Hm.., maybe about 5-7 years old): ATI releases driver for their boards as rarely as every 6 Months. Which confirms careful work on debugging each released one. NVIDIA to the contrary releases drives as often as every other Month, so they don't seem to put enough effort into debugging each of them. Indeed, they are buggy in my experience. You, the customer, do at least part of their job: by discovering and reporting bugs ("artefacts" etc).
I might also point out that there are really *not* a lot of BIOS manufacturers. AMI, and - is Phoenix still doing them? - and Dell claims it's got its own, but who knows what they've rebranded. Once you consider that, then you need to consider the board maker. Some seem to do a lot better job of qa/qc than others. For example, some folks here like Supermicro, where I *REALLY* don't - many of our Penguins, which are rebranded SuperMicro, have a *lot* of issues with the m/b.
I gave on the SiperMicro quite a while ago. Not because of BIOS, but because of hardware engineering flaws. Which at least manifests itself with system boards for AMD CPUs. These (AMD) boards work reliably for only 2-4 years, after that they die. Not all of them, but about 50% of SuperMicro AMD server and mostly workstation boards (I have no experience with their low end desktop boards if they exist) are dead after 3-4 years - just my experience. Nothing like that with tyan boards; I may have seen one out of 50 or 70 tyan boards died (which event I don't even care to recollect) the rest keep working for 10+ years (during which time the box is re-purposed twice, as I can not throw away something that still works, so I have to find new use for now weaker machine). For Desktops we use Dell, being same cheap on lower end as others they proved reliable for us. And as somebody mentioned it can be any brand inside with Dell sticker on top. One of my Linux friends who complains that they change chipsets almost on a daily basis was calling it (not them, but what they do I guess) D'hell ;-)
Valeri
mark mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On 2014-09-08, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I gave on the SiperMicro quite a while ago. Not because of BIOS, but because of hardware engineering flaws. Which at least manifests itself with system boards for AMD CPUs. These (AMD) boards work reliably for only 2-4 years, after that they die. Not all of them, but about 50% of SuperMicro AMD server and mostly workstation boards (I have no experience with their low end desktop boards if they exist) are dead after 3-4 years
Huh. I have a bunch of SuperMicro boards with AMD CPUs, and have had only one die completely, and that was a DOA that I returned before putting into production.
Are you saying dead-dead, like completely unusable, or sorta dead, where you get spurious and unexplained errors?
--keith
Keith Keller wrote:
On 2014-09-08, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I gave on the SiperMicro quite a while ago. Not because of BIOS, but because of hardware engineering flaws. Which at least manifests itself with system boards for AMD CPUs. These (AMD) boards work reliably for only 2-4 years, after that they die. Not all of them, but about 50% of SuperMicro AMD server and mostly workstation boards (I have no experience with their low end desktop boards if they exist) are dead
after 3-4
years
Huh. I have a bunch of SuperMicro boards with AMD CPUs, and have had only one die completely, and that was a DOA that I returned before putting into production.
Are you saying dead-dead, like completely unusable, or sorta dead, where you get spurious and unexplained errors?
I know I've ranted before, but on Penguin's high-end compute rack mount servers, with, I think, an H8QG m/b, running 64-cores, we've gotten some where a heavy compute user process has crashed the system, and we even canned a script and put that on a fresh, basic CentOS install, and sent it back, and Penguin's replaced m/b's. Several of them. Some more than once, honestly.
And then there's the engineering, where for two of the DIMMs, I need to unplug the main connector from the PSU, because that's the only way to pull the ears down far enough to remove the DIMMS.
We won't mention all the DIMMs that they've replaced (he says, meaning to call them about one system *again*....)
mark
On Mon, September 8, 2014 2:45 pm, Keith Keller wrote:
On 2014-09-08, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
I gave on the SiperMicro quite a while ago. Not because of BIOS, but because of hardware engineering flaws. Which at least manifests itself with system boards for AMD CPUs. These (AMD) boards work reliably for only 2-4 years, after that they die. Not all of them, but about 50% of SuperMicro AMD server and mostly workstation boards (I have no experience with their low end desktop boards if they exist) are dead after 3-4 years
Huh. I have a bunch of SuperMicro boards with AMD CPUs, and have had only one die completely, and that was a DOA that I returned before putting into production.
Are you saying dead-dead, like completely unusable, or sorta dead, where you get spurious and unexplained errors?
It begins with random occasional errors, and ends up totally dead in a course of couple of weeks to couple of Months. You pull CPUs and RAM from this dead one stick into another (I'm tempted to say "tyan this time" ;-), and these work. At this point you can't flash BIOS - not in house. My hunch is: this is engineering flaw, it looks like the board topology isn't too good around one of the CPU sockets, so it's marginally works (without much reserve) while system board is new, then with slight gradual degradation of components... Maybe the ripple on the leads is below but close to tolerable. Or capacitances and inductances [of the board leads] involved are such. I can't offer [much] more detail on what I observed, it's been some time since I banged my head around that. And you can imagine how happy I was to forget about it after I gave up on them. Anyway, there are still SiperMicro boards in our stalls which are still kicking. So I'm not saying all of them, I just don't care to learn on my hide which are and which are not.
Oh, BTW, all electrolytic capacitors on these strangely died SuperMicro boards are OK. All of us have seen those around CPUs on dead system boards (mostly manufactured during some period of time) mostly bulging and leaked out - not in case of these strangely died boards. Some of capacitors can loose capacitance to some extent without showing signs of anything, but good engineering usually takes that into account, and uses to necessary extent larger ones, so that doesn't even comes close to margin during equipment life.
Valeri
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On Sep 8, 2014, at 10:25 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
Mark Tinberg wrote:
A lack of updates can also mean that there is a lack of effort or
competence
is tracking down and fixing bugs, or not a large enough customer base with the same bugs to generate sufficient, actionable, bug reports, it is not necessarily or even primarily a signal of quality.
You may be right. But in many cases you may be wrong. I'm stealing someone's else example (Hm.., maybe about 5-7 years old): ATI releases driver for their boards as rarely as every 6 Months. Which confirms careful work on debugging each released one. NVIDIA to the contrary releases drives as often as every other Month, so they don't seem to put enough effort into debugging each of them. Indeed, they are buggy in my experience. You, the customer, do at least part of their job: by discovering and reporting bugs ("artefacts" etc).
While that is an interesting point I think that graphics drivers and firmware are sufficiently different in development practices that you may not be able to generalize from one to the other, graphics drivers are about cutting edge software features and performance, firmware is about long term stability and low level hardware details.
— Mark Tinberg mtinberg@wisc.edu
On Tue, September 9, 2014 9:33 am, Mark Tinberg wrote:
On Sep 8, 2014, at 10:25 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
Mark Tinberg wrote:
A lack of updates can also mean that there is a lack of effort or
competence
is tracking down and fixing bugs, or not a large enough customer base with the same bugs to generate sufficient, actionable, bug reports, it is not necessarily or even primarily a signal of quality.
You may be right. But in many cases you may be wrong. I'm stealing someone's else example (Hm.., maybe about 5-7 years old): ATI releases driver for their boards as rarely as every 6 Months. Which confirms careful work on debugging each released one. NVIDIA to the contrary releases drives as often as every other Month, so they don't seem to put enough effort into debugging each of them. Indeed, they are buggy in my experience. You, the customer, do at least part of their job: by discovering and reporting bugs ("artefacts" etc).
While that is an interesting point I think that graphics drivers and firmware are sufficiently different in development practices that you may not be able to generalize from one to the other, graphics drivers are about cutting edge software features and performance, firmware is about long term stability and low level hardware details.
Naturally... I never said mine is not a layman's opinion ;-) Hence layman's comparison...
Valeri
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++