At my physics lab we have 30 servers with 1TB disk packs. I am in need of monitoring for disk failures. I have been reading about SMART and it seems it can help. However, I am not sure what to look for if a drive is about to fail. Any thoughts about this? Is anyone using this method to predetermine disk failures?
TIA
On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam magawake@gmail.com wrote:
At my physics lab we have 30 servers with 1TB disk packs. I am in need of monitoring for disk failures. I have been reading about SMART and it seems it can help. However, I am not sure what to look for if a drive is about to fail. Any thoughts about this? Is anyone using this method to predetermine disk failures?
Here are a few references from my archives w.r.t. SMART ...
Hope they help ...
-rak-
====
http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml Google Releases Paper on Disk Reliability*"The Google engineers just published a paper on Failure Trends in a Large Disk Drive Populationhttp://labs.google.com/papers/disk_failures.pdf. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
* http://hardware.slashdot.org/hardware/07/02/21/004233.shtml
Everything You Know About Disks Is Wrong*"Google's wasn't the best storage paper at FAST '07 http://www.usenix.org/events/fast07/. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.htmlThe paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points http://storagemojo.com/?p=383."*
http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50
Monitoring Hard Disks with SMART By Bruce Allenhttp://www.linuxjournal.com/user/801273on Thu, 2004-01-01 02:00. SysAdmin http://www.linuxjournal.com/taxonomy/term/8 One of your hard disks might be trying to tell you it's not long for this world. Install software that lets you know when to replace it.
It's a given that all disks eventually die, and it's easy to see why. The platters in a modern disk drive rotate more than a hundred times per second, maintaining submicron tolerances between the disk heads and the magnetic media that store data. Often they run 24/7 in dusty, overheated environments, thrashing on heavily loaded or poorly managed machines. So, it's not surprising that experienced users are all too familiar with the symptoms of a dying disk. Strange things start happening. Inscrutable kernel error messages cover the console and then the system becomes unstable and locks up. Often, entire days are lost repeating recent work, re-installing the OS and trying to recover data. Even if you have a recent backup, sudden disk failure is a minor catastrophe.
http://smartmontools.sourceforge.net/
smartmontools Home Page
Welcome! This is the home page for the smartmontools package.
Rak,
Thanks! The Google paper is intense. I was hoping to get some practical usage with command or scripts to better monitor my SMART environment.
On Sat, Aug 30, 2008 at 4:57 AM, Richard Karhuse rkarhuse@gmail.com wrote:
On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam magawake@gmail.com wrote:
At my physics lab we have 30 servers with 1TB disk packs. I am in need of monitoring for disk failures. I have been reading about SMART and it seems it can help. However, I am not sure what to look for if a drive is about to fail. Any thoughts about this? Is anyone using this method to predetermine disk failures?
Here are a few references from my archives w.r.t. SMART ...
Hope they help ...
-rak-
====
http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml
Google Releases Paper on Disk Reliability
"The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
http://hardware.slashdot.org/hardware/07/02/21/004233.shtml
Everything You Know About Disks Is Wrong
"Google's wasn't the best storage paper at FAST '07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points."
http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50
Monitoring Hard Disks with SMART
By Bruce Allen on Thu, 2004-01-01 02:00. SysAdmin One of your hard disks might be trying to tell you it's not long for this world. Install software that lets you know when to replace it.
It's a given that all disks eventually die, and it's easy to see why. The platters in a modern disk drive rotate more than a hundred times per second, maintaining submicron tolerances between the disk heads and the magnetic media that store data. Often they run 24/7 in dusty, overheated environments, thrashing on heavily loaded or poorly managed machines. So, it's not surprising that experienced users are all too familiar with the symptoms of a dying disk. Strange things start happening. Inscrutable kernel error messages cover the console and then the system becomes unstable and locks up. Often, entire days are lost repeating recent work, re-installing the OS and trying to recover data. Even if you have a recent backup, sudden disk failure is a minor catastrophe.
http://smartmontools.sourceforge.net/
smartmontools Home Page
Welcome! This is the home page for the smartmontools package.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Saturday 30 August 2008 09:57:10 Richard Karhuse wrote:
http://smartmontools.sourceforge.net/
smartmontools Home Page
Welcome! This is the home page for the smartmontools package.
I use this, and it's worth noting that it can be run on windows boxes, too.
Anne
Thankyou Anne.
I just installed this and its seems to work. I am behind a RAID controller so hopefully anyone with cciss drivers can shed some light. I am able to see my logical devices but I have 6 drives per logical device. I would like to see all 12 drive status if possible.
On Sat, Aug 30, 2008 at 6:11 AM, Anne Wilson cannewilson@googlemail.com wrote:
On Saturday 30 August 2008 09:57:10 Richard Karhuse wrote:
http://smartmontools.sourceforge.net/
smartmontools Home Page
Welcome! This is the home page for the smartmontools package.
I use this, and it's worth noting that it can be run on windows boxes, too.
Anne
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
If using Linux, smart runs as a daemon and watches all drives. What I've done is create a cron job that searches /var/log/messages for the word smart and emails me the result. If I get a blank message, no drive problems,
The following url has a sample of what /var/log/messages might show if smart has something to report: http://defindit.com/readme_files/smartd_smartctl.html
Scott
On Sat, 30 Aug 2008, Mag Gam wrote:
Thankyou Anne.
I just installed this and its seems to work. I am behind a RAID controller so hopefully anyone with cciss drivers can shed some light. I am able to see my logical devices but I have 6 drives per logical device. I would like to see all 12 drive status if possible.
On Sat, Aug 30, 2008 at 6:11 AM, Anne Wilson cannewilson@googlemail.com wrote:
On Saturday 30 August 2008 09:57:10 Richard Karhuse wrote:
http://smartmontools.sourceforge.net/
smartmontools Home Page
Welcome! This is the home page for the smartmontools package.
I use this, and it's worth noting that it can be run on windows boxes, too.
Anne
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Mag Gam wrote:
Thankyou Anne.
I just installed this and its seems to work. I am behind a RAID controller so hopefully anyone with cciss drivers can shed some light. I am able to see my logical devices but I have 6 drives per logical device. I would like to see all 12 drive status if possible.
The RAID controller watches the SMART status of the drives for you.
You should watch the status of the RAID controller. It should give a warning if a drive is about to fail or has failed.
I use array-info to get the status: http://sourceforge.net/projects/array-info/
# /usr/local/bin/array-info -d /dev/cciss/c0d0 Compaq Smart Array 5312 Firmware revision : 2.58 Rom revision : 2.58 1 logical drive configured.
Logical drive 0 : Fault tolerance : RAID 5 Size : 957.10 GiB (2007181360 blocks of 512 bytes) Status : Logical drive is ok
Mogens
Thankyou again.
I suppose I can take a look at smartd to get log files and have them forward to syslog-ng, unless smartd has an email feature :-)
What does your smartd config look like for HP P400/800 controller? I would be curious to look at that.
TIA
On Sat, Aug 30, 2008 at 10:15 AM, Mogens Kjaer mk@crc.dk wrote:
Mag Gam wrote:
Thankyou Anne.
I just installed this and its seems to work. I am behind a RAID controller so hopefully anyone with cciss drivers can shed some light. I am able to see my logical devices but I have 6 drives per logical device. I would like to see all 12 drive status if possible.
The RAID controller watches the SMART status of the drives for you.
You should watch the status of the RAID controller. It should give a warning if a drive is about to fail or has failed.
I use array-info to get the status: http://sourceforge.net/projects/array-info/
# /usr/local/bin/array-info -d /dev/cciss/c0d0 Compaq Smart Array 5312 Firmware revision : 2.58 Rom revision : 2.58 1 logical drive configured.
Logical drive 0 : Fault tolerance : RAID 5 Size : 957.10 GiB (2007181360 blocks of 512 bytes) Status : Logical drive is ok
Mogens
-- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Fax: +45 33 27 47 08 Email: mk@crc.dk Homepage: http://www.crc.dk _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Mag Gam wrote:
Thankyou again.
I suppose I can take a look at smartd to get log files and have them forward to syslog-ng, unless smartd has an email feature :-)
What does your smartd config look like for HP P400/800 controller? I would be curious to look at that.
I don't run smartd on the drives.
I can get the SMART status like:
smartctl -d cciss,N -a /dev/cciss/c0d0
where N is the drive number (from 0), but it will only return the status of 6 of the 8 drives.
Mogens
Mogens,
Correct thats what I am using.
N=0 is the controller N=1 1 drive N=2 2 Drive N>3 is not working for me. Strange
I have 2 logical drives. /dev/cciss/c0d1 and /dev/cciss/c0d2
Each logical drive has 6 physical volumes totaling 12 physical volumes
Are you experiencing the same thing?
On Sat, Aug 30, 2008 at 11:27 AM, Mogens Kjaer mk@crc.dk wrote:
Mag Gam wrote:
Thankyou again.
I suppose I can take a look at smartd to get log files and have them forward to syslog-ng, unless smartd has an email feature :-)
What does your smartd config look like for HP P400/800 controller? I would be curious to look at that.
I don't run smartd on the drives.
I can get the SMART status like:
smartctl -d cciss,N -a /dev/cciss/c0d0
where N is the drive number (from 0), but it will only return the status of 6 of the 8 drives.
Mogens
-- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Fax: +45 33 27 47 08 Email: mk@crc.dk Homepage: http://www.crc.dk _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Mag Gam wrote:
Mogens,
Correct thats what I am using.
N=0 is the controller N=1 1 drive N=2 2 Drive N>3 is not working for me. Strange
I have 2 logical drives. /dev/cciss/c0d1 and /dev/cciss/c0d2
Each logical drive has 6 physical volumes totaling 12 physical volumes
Are you experiencing the same thing?
No. N refers to physical drives. N=0 is the first drive.
Mogens
But, how would the OS know about physical drives. I though it would only know about the logical drive
On Sat, Aug 30, 2008 at 11:43 AM, Mogens Kjaer mk@crc.dk wrote:
Mag Gam wrote:
Mogens,
Correct thats what I am using.
N=0 is the controller N=1 1 drive N=2 2 Drive N>3 is not working for me. Strange
I have 2 logical drives. /dev/cciss/c0d1 and /dev/cciss/c0d2
Each logical drive has 6 physical volumes totaling 12 physical volumes
Are you experiencing the same thing?
No. N refers to physical drives. N=0 is the first drive.
Mogens
-- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Fax: +45 33 27 47 08 Email: mk@crc.dk Homepage: http://www.crc.dk _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
When I do a scan for 0
I get this,
Device: HP P800 Version: 5.20
Terminate command early due to bad response to IEC mode page
Very strange...
Also, I am using
smartctl -a -d cciss,1 -i /dev/cciss/c0d0 smartctl -a -d cciss,2 -i /dev/cciss/c0d0 smartctl -a -d cciss,3 -i /dev/cciss/c0d0
If I go above 3 I get the same type of error. I am not sure why this is occuring. Any ideas?
TIA On 8/30/08, Mag Gam magawake@gmail.com wrote:
But, how would the OS know about physical drives. I though it would only know about the logical drive
On Sat, Aug 30, 2008 at 11:43 AM, Mogens Kjaer mk@crc.dk wrote:
Mag Gam wrote:
Mogens,
Correct thats what I am using.
N=0 is the controller N=1 1 drive N=2 2 Drive N>3 is not working for me. Strange
I have 2 logical drives. /dev/cciss/c0d1 and /dev/cciss/c0d2
Each logical drive has 6 physical volumes totaling 12 physical volumes
Are you experiencing the same thing?
No. N refers to physical drives. N=0 is the first drive.
Mogens
-- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Fax: +45 33 27 47 08 Email: mk@crc.dk Homepage: http://www.crc.dk _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Mag Gam wrote:
When I do a scan for 0
I get this,
Device: HP P800 Version: 5.20
Terminate command early due to bad response to IEC mode page
Very strange...
Also, I am using
smartctl -a -d cciss,1 -i /dev/cciss/c0d0 smartctl -a -d cciss,2 -i /dev/cciss/c0d0 smartctl -a -d cciss,3 -i /dev/cciss/c0d0
If I go above 3 I get the same type of error. I am not sure why this is occuring. Any ideas?
No. I get:
# smartctl -a -d cciss,0 -i /dev/cciss/c0d0 smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/
Device: COMPAQ BD14685A26 Version: HPB8 Serial number: 3HY0Y2A000007345YXGK Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Mon Sep 1 07:50:33 2008 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK
Current Drive Temperature: 30 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 1002192096 Blocks received from initiator = 3013775768 Blocks read from cache and sent to initiator = 2824497623 Number of read and write commands whose size <= segment size = 2618352439 Number of read and write commands whose size > segment size = 315346 Vendor (Seagate/Hitachi) factory information number of hours powered up = 44957.97 number of minutes until next internal SMART test = 66
Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 201675054 0 0 201675054 201675054 308793.455 0 write: 0 0 0 0 0 8566.933 0 verify: 38398 0 0 38398 38398 146.816 0
Non-medium error count: 39977
SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 2 - [- - -] # 2 Background short Completed - 2 - [- - -]
Long (extended) Self Test duration: 3072 seconds [51.2 minutes]
And:
# smartctl -a -d cciss,1 -i /dev/cciss/c0d0 smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/
Device: COMPAQ BD14685A26 Version: HPB8 Serial number: 3HY0XJF300007346LYPT Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Mon Sep 1 07:51:32 2008 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK
Current Drive Temperature: 30 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 2178327863 Blocks received from initiator = 2023231997 Blocks read from cache and sent to initiator = 822049249 Number of read and write commands whose size <= segment size = 2230377771 Number of read and write commands whose size > segment size = 307242 Vendor (Seagate/Hitachi) factory information number of hours powered up = 44604.20 number of minutes until next internal SMART test = 66
Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 239718748 0 0 239718748 239718748 252353.742 0 write: 0 0 0 0 0 8059.356 0 verify: 46661 0 0 46661 46661 146.816 0
Non-medium error count: 30252
SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 2 - [- - -] # 2 Background short Completed - 2 - [- - -]
Long (extended) Self Test duration: 3072 seconds [51.2 minutes]
Mogens