On 25/08/14 04:03 PM, Les Mikesell wrote:
I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?
IBM used LSI-based controllers, I believe.
For our monitoring, we wrote a little script that calls MegaCli64 every 30 seconds and checks for changes. If anything of note changes (drive health, BBU/FBU issues, temperature issues, etc) it sends us an email. It would be fairly easy to do the same for hpacucli, I would imagine.
Unfortunately, though it's all open source, it's part of a package that monitors a pile of things (including IPMI sensors, APC UPSes, Red Hat HA stack, etc), so it wouldn't be drop-in-and-go. That said, you could probably fairly easily strip it down if you wanted to use it, too.
If you're curious, I show how to set it up here. If you're comfortable with perl, it'll be pretty easy to adapt, I suspect.
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Setting_Up_Alerts
Cheers