Hardware raid health?

List overview All Threads
Download

newer

older

Postfix setup

Re: [CentOS] NetworkManager

Les Mikesell

25 Aug 2014 25 Aug '14

8:03 p.m.

I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?

-- Les Mikesell lesmikesell@gmail.com

Show replies by date

Digimer

25 Aug 25 Aug

8:08 p.m.

On 25/08/14 04:03 PM, Les Mikesell wrote:

...

I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?

IBM used LSI-based controllers, I believe.

For our monitoring, we wrote a little script that calls MegaCli64 every 30 seconds and checks for changes. If anything of note changes (drive health, BBU/FBU issues, temperature issues, etc) it sends us an email. It would be fairly easy to do the same for hpacucli, I would imagine.

Unfortunately, though it's all open source, it's part of a package that monitors a pile of things (including IPMI sensors, APC UPSes, Red Hat HA stack, etc), so it wouldn't be drop-in-and-go. That said, you could probably fairly easily strip it down if you wanted to use it, too.

If you're curious, I show how to set it up here. If you're comfortable with perl, it'll be pretty easy to adapt, I suspect.

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Setting_Up_Alerts

Cheers

-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

Jason Pyeron

8:11 p.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Les Mikesell Sent: Monday, August 25, 2014 16:03 To: CentOS mailing list Subject: [CentOS] Hardware raid health?

I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?

We use MegaCLI, but it has the risk of hanging the box (observed only once).

Just changed out a drive last night because of it.

-Jason

-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.

Digimer

8:23 p.m.

On 25/08/14 04:11 PM, Jason Pyeron wrote:

...

...
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Les Mikesell Sent: Monday, August 25, 2014 16:03 To: CentOS mailing list Subject: [CentOS] Hardware raid health?

I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?

We use MegaCLI, but it has the risk of hanging the box (observed only once).

Just changed out a drive last night because of it.

-Jason

Can you share any detail on this? Controller/drive model? MegaCli version? How exactly did it lock up?

I use it extensively so this worries me. :)

-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

Jason Pyeron

8:52 p.m.

...

-----Original Message----- From: Digimer Sent: Monday, August 25, 2014 16:23

On 25/08/14 04:11 PM, Jason Pyeron wrote:

...
...
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Les Mikesell Sent: Monday, August 25, 2014 16:03 To: CentOS mailing list Subject: [CentOS] Hardware raid health?

I just had an IBM in a remote location with a hardware

raid1 have both

...
...
drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is

the state of

...
...
the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a

Java GUI tool

...
...
that IBM had for their servers - and that seemed like

overkill for a

...
...
simple monitor. Is there anything more lightweight that

knows about

...
...
the underlying drives in a hardware raid set on IBM's - and also recent HP servers?

We use MegaCLI, but it has the risk of hanging the box

(observed only once).

...
Just changed out a drive last night because of it.

-Jason

Can you share any detail on this? Controller/drive model? MegaCli version? How exactly did it lock up?

Locked up the OS, not the array. Power cycled after the array synced the new drive 6 hours later.

On a Dell PE2970 Product Name : PERC 6/i Integrated FW Package Build: 6.2.0-0013

Mfg. Data ================ Mfg. Date : 06/24/08 Rework Date : 06/24/08 Revision No : Battery FRU : N/A

Image Versions in Flash: ================ FW Version : 1.22.02-0612 BIOS Version : 2.04.00 WebBIOS Version : 1.1-46-e_15-Rel Ctrl-R Version : 1.02-015B Preboot CLI Version: 01.00-023:#%00006 Boot Block Version : 1.00.00.01-0011

MegaCLI SAS RAID Management Tool Ver 8.05.71 Apr 30, 2013

$ while MegaCli64 -PDRbld -ShowProg -PhysDrv [32:1] -aALL; do sleep 1; done

The sleep 1 was abusive!

...

I use it extensively so this worries me. :)

-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

John R Pierce

8:39 p.m.

On 8/25/2014 1:03 PM, Les Mikesell wrote:

...

I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died... What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers - and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM's - and also recent HP servers?

IF megacli64 works for this raid controller, then I tweaked some python scripts I found online and use these two scripts.. these live in /root/bin as they are only for root's use.

here's the typical output of the first script...

[root@server1 bin]# lsi-raidinfo -- Controllers -- -- ID | Model c0 | LSI MegaRAID SAS 9261-8i

first script parses megacli64's gawdawful output format....

/root/bin/lsi-raidinfo: #!/usr/bin/python

# megaclisas-status 0.6 # renamed lsi-raidinfo # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with Pulse 2; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. # # Copyright (C) 2007-2009 Adam Cecile (Le_Vert)

## modified by johnpuskar@gmail.com 08/14/11 # fixed for LSI 9285-8e on Openfiler

## modified by pierce@hogranch.com 2012-01-05 # fixed for newer version of megacli output on RHEL6/CentOS6 # output format extended to show raid span-unit and rebuild % complete

import os import re import sys

if len(sys.argv) > 2: print 'Usage: lsi-raidinfo [-d]' sys.exit(1)

# if argument -d, only print disk info printarray = True printcontroller = True if len(sys.argv) > 1: if sys.argv[1] == '-d': printarray = False printcontroller = False else: print 'Usage: lsi-raidinfo [-d]' sys.exit(1)

# Get command output def getOutput(cmd): output = os.popen(cmd) lines = [] for line in output: if not re.match(r'^$',line.strip()): lines.append(line.strip()) return lines

def returnControllerNumber(output): for line in output: if re.match(r'^Controller Count.*$',line.strip()): return int(line.split(':')[1].strip().strip('.'))

def returnControllerModel(output): for line in output: if re.match(r'^Product Name.*$',line.strip()): return line.split(':')[1].strip()

def returnArrayNumber(output): i = 0 for line in output: if re.match(r'^Virtual (Drive|Disk).*$',line.strip()): i += 1 return i

def returnArrayInfo(output,controllerid,arrayid): id = 'c'+str(controllerid)+'u'+str(arrayid) # print 'DEBUG: id = '+str(id) operationlinennumber = False linenumber = 0 units = 1 type = 'JBOD' span = 0 size = 0 for line in output: if re.match(r'^RAID Level.*$',line.strip()): type = line.strip().split(':')[1].strip() type = 'RAID' + type.split(',')[0].split('-')[1].strip() # print 'debug: type = '+str(type) if re.match(r'^Number.*$',line.strip()): units = line.strip().split(':')[1].strip() if re.match(r'^Span Depth.*$',line.strip()): span = line.strip().split(':')[1].strip() if re.match(r'^Size.*$',line.strip()): # Size reported in MB if re.match(r'^.*MB$',line.strip().split(':')[1]): size = line.strip().split(':')[1].strip('MB').strip() size = str(int(round((float(size) / 1000))))+'G' # Size reported in TB elif re.match(r'^.*TB$',line.strip().split(':')[1]): size = line.strip().split(':')[1].strip('TB').strip() size = str(int(round((float(size) * 1000))))+'G' # Size reported in GB (default) else: size = line.strip().split(':')[1].strip('GB').strip() size = str(int(round((float(size)))))+'G' if re.match(r'^State.*$',line.strip()): state = line.strip().split(':')[1].strip() if re.match(r'^Ongoing Progresses.*$',line.strip()): operationlinennumber = linenumber linenumber += 1 if operationlinennumber: inprogress = output[operationlinennumber+1] else: inprogress = 'None' if span > 1: type = type+'0' type = type + ' ' + str(span) + 'x' + str(units) return [id,type,size,state,inprogress]

def returnDiskInfo(output,controllerid,currentarrayid): arrayid = False oldarrayid = False olddiskid = False table = [] state = 'Offline' model = 'Unknown' enclnum = 'Unknown' slotnum = 'Unknown' enclsl = 'Unknown'

firstDisk = True for line in output: if re.match(r'Firmware state: .*$',line.strip()): state = line.split(':')[1].strip() if re.match(r'Rebuild',state): cmd2 = '/opt/MegaRAID/MegaCli/MegaCli64 pdrbld showprog physdrv['+str(enclnum)+':'+str(slotnum)+'] a'+str(controllerid)+' nolog' ll = getOutput(cmd2) state += ' completed ' + re.sub(r'Rebuild Progress.*Completed', '', ll[0]).strip(); if re.match(r'Slot Number: .*$',line.strip()): slotnum = line.split(':')[1].strip() if re.match(r'Inquiry Data: .*$',line.strip()): model = line.split(':')[1].strip() model = re.sub(' +', ' ', model) model = re.sub('Hotspare Information', '', model).strip() #remove bogus output from firmware 12.12 if re.match(r"(Drive|Disk)'s postion: .*$",line.strip()): spans = line.split(',') span = re.sub(r"(Drive|Disk).*DiskGroup:", '', spans[0]).strip()+'-' span += spans[1].split(':')[1].strip()+'-' span += spans[2].split(':')[1].strip() if re.match(r'Enclosure Device ID: [0-9]+$',line.strip()): if firstDisk == True: firstDisk = False else: enclsl = str(enclnum)+':'+str(slotnum) table.append([str(enclsl), span, model, state]) span = 'x-x-x' enclnum = line.split(':')[1].strip() # Last disk of last array enclsl = str(enclnum)+':'+str(slotnum) table.append([str(enclsl), span, model, state]) arraytable = [] for disk in table: arraytable.append(disk) return arraytable

cmd = '/opt/MegaRAID/MegaCli/MegaCli64 adpcount nolog' output = getOutput(cmd) controllernumber = returnControllerNumber(output)

bad = False

# List available controller if printcontroller: print '-- Controllers --' print '-- ID | Model' controllerid = 0 while controllerid < controllernumber: cmd = '/opt/MegaRAID/MegaCli/MegaCli64 adpallinfo a'+str(controllerid)+' nolog' output = getOutput(cmd) controllermodel = returnControllerModel(output) print 'c'+str(controllerid)+' | '+controllermodel controllerid += 1 print ''

if printarray: controllerid = 0 print '-- Volumes --' print '-- ID | Type | Size | Status | InProgress' # print 'controller number'+str(controllernumber) while controllerid < controllernumber: arrayid = 0 cmd = '/opt/MegaRAID/MegaCli/MegaCli64 ldinfo lall a'+str(controllerid)+' nolog' output = getOutput(cmd) arraynumber = returnArrayNumber(output) # print 'array number'+str(arraynumber) while arrayid < arraynumber: cmd = '/opt/MegaRAID/MegaCli/MegaCli64 ldinfo l'+str(arrayid)+' a'+str(controllerid)+' nolog' # print 'DEBUG: running '+str(cmd) output = getOutput(cmd) # print 'DEBUG: output '+str(output) arrayinfo = returnArrayInfo(output,controllerid,arrayid) print 'volume '+arrayinfo[0]+' | '+arrayinfo[1]+' | '+arrayinfo[2]+' | '+arrayinfo[3]+' | '+arrayinfo[4] if not arrayinfo[3] == 'Optimal': bad = True arrayid += 1 controllerid += 1 print ''

print '-- Disks --' print '-- Encl:Slot | vol-span-unit | Model | Status'

controllerid = 0 while controllerid < controllernumber: arrayid = 0 cmd = '/opt/MegaRAID/MegaCli/MegaCli64 ldinfo lall a'+str(controllerid)+' nolog' output = getOutput(cmd) arraynumber = returnArrayNumber(output) while arrayid<arraynumber: #grab disk arrayId info cmd = '/opt/MegaRAID/MegaCli/MegaCli64 pdlist a'+str(controllerid)+' nolog' #print 'debug: running '+str(cmd) output = getOutput(cmd) arraydisk = returnDiskInfo(output,controllerid,arrayid)

for array in arraydisk: print 'disk '+array[0]+' | '+array[1]+' | '+array[2]+' | '+array[3] arrayid += 1 controllerid += 1

if bad: print '\nThere is at least one disk/array in a NOT OPTIMAL state.' sys.exit(1) ******************************************************************************************************

second script checks the output of that first one and summarizes errors only.

/root/bin/lsi-checkraid:

#!/usr/bin/python

# created by johnpuskar@gmail.com on 08/14/11 # rev 01

import os import re import sys

if len(sys.argv) > 1: print 'Usage: accepts stdin from lsi-raidinfo' sys.exit(1)

blnBadDisk = False infile = sys.stdin for line in infile: # print 'DEBUG!! checking line:'+str(line) if re.match(r'disk .*$',line.strip()): if re.match(r'^((?!Online, Spun Up|Online, Spun down|Hotspare, Spun Up|Hotspare, Spun down|Unconfigured(good), Spun Up).)*$',line .strip()): blnBadDisk = True badLine = line # print 'DEBUG!! bad disk found!' if re.match(r'volume ',line.strip()): if re.match(r'^((?!Optimal).)*$',line.strip()): # print 'DEBUG!! bad vol found!' blnBadDisk = True badLine = line

if blnBadDisk == True: print 'RAID ERROR' # print badLine else: print 'RAID CLEAN'

******************************************************************************************************

finally, this script uses those and sends email alerts. its run from crontab hourly as root.

/root/bin/lsi-emailalerts

#!/bin/sh

MAILTOADDR=root HOST=$(hostname -s| tr [a-z] [A-Z])

#get megaraid status info /root/bin/lsi-raidinfo | tee /tmp/lsi-raidinfo.txt | /root/bin/lsi-checkraid > /tmp/lsi-checkraid.txt

#check megaraid status info if grep -qE "RAID ERROR" /tmp/lsi-checkraid.txt ; then cat /tmp/lsi-raidinfo.txt | mailx -s "$HOST Warning: failed disk or degraded array" $MAILTOADDR fi

#rm -f /tmp/lsi-raidinfo.txt #rm -f /tmp/lsi-checkraid.txt exit 0

******************************************************************************************************

-- john r pierce 37N 122W somewhere on the middle of the left coast

Keith Keller

9:23 p.m.

On 2014-08-25, John R Pierce pierce@hogranch.com wrote:

...

IF megacli64 works for this raid controller, then I tweaked some python scripts I found online and use these two scripts.. these live in /root/bin as they are only for root's use.

They can probably go anywhere, since a normal user won't have the permissions to open the proper devices anyway.

I use slightly modified versions of these scripts with Nagios. I haven't had a drive fail yet (so one is sure to fail in the next day or two), but the scripts worked when the chiller in the room failed and the temperature spiked--they notified me that the internal temperatures of the ROC and the drives were all too high.

There is a GUI to the MegaRAID controllers available. I seldom use it so I can't give too much information about it.

If the OP's servers use a different controller there may still be scripts like these, just let us know what the hardware is. (I know they exist for 3ware, I think they may for Areca.)

--keith

-- kkeller@wombat.san-francisco.ca.us

3936

Age (days ago)

3936

Last active (days ago)

discuss@lists.centos.org

6 comments

5 participants

tags (0)

participants (5)

Digimer
Jason Pyeron
John R Pierce
Keith Keller
Les Mikesell