I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?
On 25/08/14 04:03 PM, Les Mikesell wrote:
I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?
IBM used LSI-based controllers, I believe.
For our monitoring, we wrote a little script that calls MegaCli64 every 30 seconds and checks for changes. If anything of note changes (drive health, BBU/FBU issues, temperature issues, etc) it sends us an email. It would be fairly easy to do the same for hpacucli, I would imagine.
Unfortunately, though it's all open source, it's part of a package that monitors a pile of things (including IPMI sensors, APC UPSes, Red Hat HA stack, etc), so it wouldn't be drop-in-and-go. That said, you could probably fairly easily strip it down if you wanted to use it, too.
If you're curious, I show how to set it up here. If you're comfortable with perl, it'll be pretty easy to adapt, I suspect.
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Setting_Up_Alerts
Cheers
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Les Mikesell Sent: Monday, August 25, 2014 16:03 To: CentOS mailing list Subject: [CentOS] Hardware raid health?
I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?
We use MegaCLI, but it has the risk of hanging the box (observed only once).
Just changed out a drive last night because of it.
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
On 25/08/14 04:11 PM, Jason Pyeron wrote:
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Les Mikesell Sent: Monday, August 25, 2014 16:03 To: CentOS mailing list Subject: [CentOS] Hardware raid health?
I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?
We use MegaCLI, but it has the risk of hanging the box (observed only once).
Just changed out a drive last night because of it.
-Jason
Can you share any detail on this? Controller/drive model? MegaCli version? How exactly did it lock up?
I use it extensively so this worries me. :)
-----Original Message----- From: Digimer Sent: Monday, August 25, 2014 16:23
On 25/08/14 04:11 PM, Jason Pyeron wrote:
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Les Mikesell Sent: Monday, August 25, 2014 16:03 To: CentOS mailing list Subject: [CentOS] Hardware raid health?
I just had an IBM in a remote location with a hardware
raid1 have both
drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is
the state of
the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a
Java GUI tool
that IBM had for their servers - and that seemed like
overkill for a
simple monitor. Is there anything more lightweight that
knows about
the underlying drives in a hardware raid set on IBM's - and also recent HP servers?
We use MegaCLI, but it has the risk of hanging the box
(observed only once).
Just changed out a drive last night because of it.
-Jason
Can you share any detail on this? Controller/drive model? MegaCli version? How exactly did it lock up?
Locked up the OS, not the array. Power cycled after the array synced the new drive 6 hours later.
On a Dell PE2970 Product Name : PERC 6/i Integrated FW Package Build: 6.2.0-0013
Mfg. Data ================ Mfg. Date : 06/24/08 Rework Date : 06/24/08 Revision No : Battery FRU : N/A
Image Versions in Flash: ================ FW Version : 1.22.02-0612 BIOS Version : 2.04.00 WebBIOS Version : 1.1-46-e_15-Rel Ctrl-R Version : 1.02-015B Preboot CLI Version: 01.00-023:#%00006 Boot Block Version : 1.00.00.01-0011
MegaCLI SAS RAID Management Tool Ver 8.05.71 Apr 30, 2013
$ while MegaCli64 -PDRbld -ShowProg -PhysDrv [32:1] -aALL; do sleep 1; done
The sleep 1 was abusive!
I use it extensively so this worries me. :)
-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 8/25/2014 1:03 PM, Les Mikesell wrote:
I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?
IF megacli64 works for this raid controller, then I tweaked some python scripts I found online and use these two scripts.. these live in /root/bin as they are only for root's use.
here's the typical output of the first script...
[root@server1 bin]# lsi-raidinfo -- Controllers -- -- ID | Model c0 | LSI MegaRAID SAS 9261-8i
-- Volumes -- -- ID | Type | Size | Status | InProgress volume c0u0 | RAID10 1x2 | 2727G | Optimal | None volume c0u1 | RAID60 1x8 | 16370G | Optimal | None volume c0u2 | RAID60 1x8 | 16370G | Optimal | None
-- Disks -- -- Encl:Slot | vol-span-unit | Model | Status disk 8:0 | 0-0-0 | Z291VTS5ST33000650NS 0003 | Online, Spun Up disk 8:1 | 0-0-1 | Z291VTRPST33000650NS 0003 | Online, Spun Up disk 8:2 | 1-0-0 | Z291VTKWST33000650NS 0003 | Online, Spun Up disk 8:3 | 1-0-1 | Z291VT9YST33000650NS 0003 | Online, Spun Up disk 8:4 | 1-0-2 | Z291VTT6ST33000650NS 0003 | Online, Spun Up disk 8:5 | 1-0-3 | Z291VT6CST33000650NS 0003 | Online, Spun Up disk 8:6 | 1-0-4 | Z291VTLAST33000650NS 0003 | Online, Spun Up disk 8:7 | 1-0-5 | Z291VTK1ST33000650NS 0003 | Online, Spun Up disk 8:8 | 1-0-6 | Z291VTNGST33000650NS 0003 | Online, Spun Up disk 8:9 | 1-0-7 | Z291VTRAST33000650NS 0003 | Online, Spun Up disk 8:10 | 2-0-0 | Z291VV05ST33000650NS 0003 | Online, Spun Up disk 8:11 | 2-0-1 | Z291VTW1ST33000650NS 0003 | Online, Spun Up disk 8:12 | 2-0-2 | Z291VTRLST33000650NS 0003 | Online, Spun Up disk 8:13 | 2-0-3 | Z291VTRXST33000650NS 0003 | Online, Spun Up disk 8:14 | 2-0-4 | Z291VSZGST33000650NS 0003 | Online, Spun Up disk 8:15 | 2-0-5 | Z291VSW1ST33000650NS 0003 | Online, Spun Up disk 8:16 | 2-0-6 | Z291VTB5ST33000650NS 0003 | Online, Spun Up disk 8:17 | 2-0-7 | Z291VSX8ST33000650NS 0003 | Online, Spun Up disk 8:18 | x-x-x | Z291VTS7ST33000650NS 0003 | Hotspare, Spun down disk 8:19 | x-x-x | Z291VT3HST33000650NS 0003 | Hotspare, Spun down
first script parses megacli64's gawdawful output format....
/root/bin/lsi-raidinfo: #!/usr/bin/python
# megaclisas-status 0.6 # renamed lsi-raidinfo # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with Pulse 2; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. # # Copyright (C) 2007-2009 Adam Cecile (Le_Vert)
## modified by johnpuskar@gmail.com 08/14/11 # fixed for LSI 9285-8e on Openfiler
## modified by pierce@hogranch.com 2012-01-05 # fixed for newer version of megacli output on RHEL6/CentOS6 # output format extended to show raid span-unit and rebuild % complete
import os import re import sys
if len(sys.argv) > 2: print 'Usage: lsi-raidinfo [-d]' sys.exit(1)
# if argument -d, only print disk info printarray = True printcontroller = True if len(sys.argv) > 1: if sys.argv[1] == '-d': printarray = False printcontroller = False else: print 'Usage: lsi-raidinfo [-d]' sys.exit(1)
# Get command output def getOutput(cmd): output = os.popen(cmd) lines = [] for line in output: if not re.match(r'^$',line.strip()): lines.append(line.strip()) return lines
def returnControllerNumber(output): for line in output: if re.match(r'^Controller Count.*$',line.strip()): return int(line.split(':')[1].strip().strip('.'))
def returnControllerModel(output): for line in output: if re.match(r'^Product Name.*$',line.strip()): return line.split(':')[1].strip()
def returnArrayNumber(output): i = 0 for line in output: if re.match(r'^Virtual (Drive|Disk).*$',line.strip()): i += 1 return i
def returnArrayInfo(output,controllerid,arrayid): id = 'c'+str(controllerid)+'u'+str(arrayid) # print 'DEBUG: id = '+str(id) operationlinennumber = False linenumber = 0 units = 1 type = 'JBOD' span = 0 size = 0 for line in output: if re.match(r'^RAID Level.*$',line.strip()): type = line.strip().split(':')[1].strip() type = 'RAID' + type.split(',')[0].split('-')[1].strip() # print 'debug: type = '+str(type) if re.match(r'^Number.*$',line.strip()): units = line.strip().split(':')[1].strip() if re.match(r'^Span Depth.*$',line.strip()): span = line.strip().split(':')[1].strip() if re.match(r'^Size.*$',line.strip()): # Size reported in MB if re.match(r'^.*MB$',line.strip().split(':')[1]): size = line.strip().split(':')[1].strip('MB').strip() size = str(int(round((float(size) / 1000))))+'G' # Size reported in TB elif re.match(r'^.*TB$',line.strip().split(':')[1]): size = line.strip().split(':')[1].strip('TB').strip() size = str(int(round((float(size) * 1000))))+'G' # Size reported in GB (default) else: size = line.strip().split(':')[1].strip('GB').strip() size = str(int(round((float(size)))))+'G' if re.match(r'^State.*$',line.strip()): state = line.strip().split(':')[1].strip() if re.match(r'^Ongoing Progresses.*$',line.strip()): operationlinennumber = linenumber linenumber += 1 if operationlinennumber: inprogress = output[operationlinennumber+1] else: inprogress = 'None' if span > 1: type = type+'0' type = type + ' ' + str(span) + 'x' + str(units) return [id,type,size,state,inprogress]
def returnDiskInfo(output,controllerid,currentarrayid): arrayid = False oldarrayid = False olddiskid = False table = [] state = 'Offline' model = 'Unknown' enclnum = 'Unknown' slotnum = 'Unknown' enclsl = 'Unknown'
firstDisk = True for line in output: if re.match(r'Firmware state: .*$',line.strip()): state = line.split(':')[1].strip() if re.match(r'Rebuild',state): cmd2 = '/opt/MegaRAID/MegaCli/MegaCli64 pdrbld showprog physdrv['+str(enclnum)+':'+str(slotnum)+'] a'+str(controllerid)+' nolog' ll = getOutput(cmd2) state += ' completed ' + re.sub(r'Rebuild Progress.*Completed', '', ll[0]).strip(); if re.match(r'Slot Number: .*$',line.strip()): slotnum = line.split(':')[1].strip() if re.match(r'Inquiry Data: .*$',line.strip()): model = line.split(':')[1].strip() model = re.sub(' +', ' ', model) model = re.sub('Hotspare Information', '', model).strip() #remove bogus output from firmware 12.12 if re.match(r"(Drive|Disk)'s postion: .*$",line.strip()): spans = line.split(',') span = re.sub(r"(Drive|Disk).*DiskGroup:", '', spans[0]).strip()+'-' span += spans[1].split(':')[1].strip()+'-' span += spans[2].split(':')[1].strip() if re.match(r'Enclosure Device ID: [0-9]+$',line.strip()): if firstDisk == True: firstDisk = False else: enclsl = str(enclnum)+':'+str(slotnum) table.append([str(enclsl), span, model, state]) span = 'x-x-x' enclnum = line.split(':')[1].strip() # Last disk of last array enclsl = str(enclnum)+':'+str(slotnum) table.append([str(enclsl), span, model, state]) arraytable = [] for disk in table: arraytable.append(disk) return arraytable
cmd = '/opt/MegaRAID/MegaCli/MegaCli64 adpcount nolog' output = getOutput(cmd) controllernumber = returnControllerNumber(output)
bad = False
# List available controller if printcontroller: print '-- Controllers --' print '-- ID | Model' controllerid = 0 while controllerid < controllernumber: cmd = '/opt/MegaRAID/MegaCli/MegaCli64 adpallinfo a'+str(controllerid)+' nolog' output = getOutput(cmd) controllermodel = returnControllerModel(output) print 'c'+str(controllerid)+' | '+controllermodel controllerid += 1 print ''
if printarray: controllerid = 0 print '-- Volumes --' print '-- ID | Type | Size | Status | InProgress' # print 'controller number'+str(controllernumber) while controllerid < controllernumber: arrayid = 0 cmd = '/opt/MegaRAID/MegaCli/MegaCli64 ldinfo lall a'+str(controllerid)+' nolog' output = getOutput(cmd) arraynumber = returnArrayNumber(output) # print 'array number'+str(arraynumber) while arrayid < arraynumber: cmd = '/opt/MegaRAID/MegaCli/MegaCli64 ldinfo l'+str(arrayid)+' a'+str(controllerid)+' nolog' # print 'DEBUG: running '+str(cmd) output = getOutput(cmd) # print 'DEBUG: output '+str(output) arrayinfo = returnArrayInfo(output,controllerid,arrayid) print 'volume '+arrayinfo[0]+' | '+arrayinfo[1]+' | '+arrayinfo[2]+' | '+arrayinfo[3]+' | '+arrayinfo[4] if not arrayinfo[3] == 'Optimal': bad = True arrayid += 1 controllerid += 1 print ''
print '-- Disks --' print '-- Encl:Slot | vol-span-unit | Model | Status'
controllerid = 0 while controllerid < controllernumber: arrayid = 0 cmd = '/opt/MegaRAID/MegaCli/MegaCli64 ldinfo lall a'+str(controllerid)+' nolog' output = getOutput(cmd) arraynumber = returnArrayNumber(output) while arrayid<arraynumber: #grab disk arrayId info cmd = '/opt/MegaRAID/MegaCli/MegaCli64 pdlist a'+str(controllerid)+' nolog' #print 'debug: running '+str(cmd) output = getOutput(cmd) arraydisk = returnDiskInfo(output,controllerid,arrayid)
for array in arraydisk: print 'disk '+array[0]+' | '+array[1]+' | '+array[2]+' | '+array[3] arrayid += 1 controllerid += 1
if bad: print '\nThere is at least one disk/array in a NOT OPTIMAL state.' sys.exit(1) ******************************************************************************************************
second script checks the output of that first one and summarizes errors only.
/root/bin/lsi-checkraid:
#!/usr/bin/python
# created by johnpuskar@gmail.com on 08/14/11 # rev 01
import os import re import sys
if len(sys.argv) > 1: print 'Usage: accepts stdin from lsi-raidinfo' sys.exit(1)
blnBadDisk = False infile = sys.stdin for line in infile: # print 'DEBUG!! checking line:'+str(line) if re.match(r'disk .*$',line.strip()): if re.match(r'^((?!Online, Spun Up|Online, Spun down|Hotspare, Spun Up|Hotspare, Spun down|Unconfigured(good), Spun Up).)*$',line .strip()): blnBadDisk = True badLine = line # print 'DEBUG!! bad disk found!' if re.match(r'volume ',line.strip()): if re.match(r'^((?!Optimal).)*$',line.strip()): # print 'DEBUG!! bad vol found!' blnBadDisk = True badLine = line
if blnBadDisk == True: print 'RAID ERROR' # print badLine else: print 'RAID CLEAN'
******************************************************************************************************
finally, this script uses those and sends email alerts. its run from crontab hourly as root.
/root/bin/lsi-emailalerts
#!/bin/sh
MAILTOADDR=root HOST=$(hostname -s| tr [a-z] [A-Z])
#get megaraid status info /root/bin/lsi-raidinfo | tee /tmp/lsi-raidinfo.txt | /root/bin/lsi-checkraid > /tmp/lsi-checkraid.txt
#check megaraid status info if grep -qE "RAID ERROR" /tmp/lsi-checkraid.txt ; then cat /tmp/lsi-raidinfo.txt | mailx -s "$HOST Warning: failed disk or degraded array" $MAILTOADDR fi
#rm -f /tmp/lsi-raidinfo.txt #rm -f /tmp/lsi-checkraid.txt exit 0
******************************************************************************************************
On 2014-08-25, John R Pierce pierce@hogranch.com wrote:
IF megacli64 works for this raid controller, then I tweaked some python scripts I found online and use these two scripts.. these live in /root/bin as they are only for root's use.
They can probably go anywhere, since a normal user won't have the permissions to open the proper devices anyway.
I use slightly modified versions of these scripts with Nagios. I haven't had a drive fail yet (so one is sure to fail in the next day or two), but the scripts worked when the chiller in the room failed and the temperature spiked--they notified me that the internal temperatures of the ROC and the drives were all too high.
There is a GUI to the MegaRAID controllers available. I seldom use it so I can't give too much information about it.
If the OP's servers use a different controller there may still be scripts like these, just let us know what the hardware is. (I know they exist for 3ware, I think they may for Areca.)
--keith