[CentOS] OT, hardware: HP smart array drive issue

Tue Jul 14 18:32:14 UTC 2015
Nathan Duehr <denverpilot at me.com>

On Jul 10, 2015, at 10:47, m.roth at 5-cent.us wrote:
> 
> Trying to prevent this from happening again, I've decided to replace the
> drive that's in predictive failure. The array has a hot spare. I tried to
> remove, using hpacucli, it refuses "operation not permitted", and there
> doesn't *seem* to be a "mark as failed" command. *Do* I just yank the
> drive?

Hi Mark, 

I’ve never had any problem just pulling and replacing drives on HP hardware with the hardware RAID controllers (even the icky cheap one that came out around the DL360/380 Gen 8 timeframe, that isn’t really hardware RAID and needs closed drivers in Linux).

That said, I also *test it*, long before putting anything important on them… 

From past experience with HP stuff, it usually won’t move the data over to the hot spare (especially if it’s a “Global” hot spare and not specific to that array) until an actual failure occurs.  “Predictive failure” isn’t considered a failure in HP’s world.  I don’t think there is any setting to tell the controller to move to the hot spare if there’s a “predictive failure”.

I’ve also had disks that triggered a “predictive failure” under heavy load that were simply popped out and back in, and the controller rebuilt them, and the drive never did it again for *years*.  The “predictive failure” error rate is pretty low.

That last one is more a question of policy than anything.  How much do you trust it?  At one employer the game was to pop out and back in any drive that showed “predictive failure” on HP systems (Dell stuff we handled differently at the time, it was less prone to false alarms, so to speak) and if they did it again “soonish”, we’d call for the replacement disk.  That’s how often the HP controllers did it.  In a rather large farm of HP stuff, I popped and replaced an HP drive a week, whenever I happened by the data center.

As for the question of whether you should be able to do it safely or not… if a hardware RAID controller won’t let me yank a physical drive out and shove another one in and rebuild itself back to whatever level of redundancy was defined by me as “nominal” for that system, I don’t want it anyway.  Look at it this way… if the disk had a catastrophic electronics failure while installed in the array, the array should handle it… yanking it out is technically nicer than some of the failure modes that can affect the busses on the backplane with shorted electronics. (GRIN)

Just sharing my thoughts… your call. :-)  YMMV.  We had a service contract at that place and a new disk was always just a phone call away and no additional $, and even with that level of service, we always did the “re-seat it once” thing.  We’d log it and if anyone else saw that same disk flashing the next time they were at the data center (we just looked at the logged ones before doing the “re-seat”), they’d make the phone call and the service company would drop a drive off a few hours later.

--
Nate Duehr
denverpilot at me.com