[CentOS] Hardware raid health?

Mon Aug 25 20:08:32 UTC 2014
Digimer <lists at alteeve.ca>

On 25/08/14 04:03 PM, Les Mikesell wrote:
> I just had an IBM in a remote location with a hardware raid1 have both
> drives go bad.  With local machines I probably would have caught it
> from the drive light before the 2nd one died...  What is the state of
> the art in linux software monitoring for this?   Long ago when that
> box was set up I think the best I could have done was a Java GUI tool
> that IBM had for their servers - and that seemed like overkill for a
> simple monitor.    Is there anything more lightweight that knows about
> the underlying drives in a hardware raid set on IBM's - and also
> recent HP servers?

IBM used LSI-based controllers, I believe.

For our monitoring, we wrote a little script that calls MegaCli64 every 
30 seconds and checks for changes. If anything of note changes (drive 
health, BBU/FBU issues, temperature issues, etc) it sends us an email. 
It would be fairly easy to do the same for hpacucli, I would imagine.

Unfortunately, though it's all open source, it's part of a package that 
monitors a pile of things (including IPMI sensors, APC UPSes, Red Hat HA 
stack, etc), so it wouldn't be drop-in-and-go. That said, you could 
probably fairly easily strip it down if you wanted to use it, too.

If you're curious, I show how to set it up here. If you're comfortable 
with perl, it'll be pretty easy to adapt, I suspect.

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Setting_Up_Alerts

Cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?