Rak, Thanks! The Google paper is intense. I was hoping to get some practical usage with command or scripts to better monitor my SMART environment. On Sat, Aug 30, 2008 at 4:57 AM, Richard Karhuse <rkarhuse at gmail.com> wrote: > > > On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam <magawake at gmail.com> wrote: >> >> At my physics lab we have 30 servers with 1TB disk packs. I am in need >> of monitoring for disk failures. I have been reading about SMART and >> it seems it can help. However, I am not sure what to look for if a >> drive is about to fail. Any thoughts about this? Is anyone using this >> method to predetermine disk failures? > > > Here are a few references from my archives w.r.t. SMART ... > > Hope they help ... > > -rak- > > ==== > > http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml > > Google Releases Paper on Disk Reliability > > "The Google engineers just published a paper on Failure Trends in a Large > Disk Drive Population. Based on a study of 100,000 disk drives over 5 years > they find some interesting stuff. To quote from the abstract: 'Our analysis > identifies several parameters from the drive's self monitoring facility > (SMART) that correlate highly with failures. Despite this high correlation, > we conclude that models based on SMART parameters alone are unlikely to be > useful for predicting individual drive failures. Surprisingly, we found that > temperature and activity levels were much less correlated with drive > failures than previously reported.'" > > > http://hardware.slashdot.org/hardware/07/02/21/004233.shtml > > Everything You Know About Disks Is Wrong > > "Google's wasn't the best storage paper at FAST '07. Another, more > provocative paper looking at real-world results from 100,000 disk drives got > the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, > submitted Disk failures in the real world: What does an MTTF of 1,000,000 > hours mean to you? The paper crushes a number of (what we now know to be) > myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' > drive reliability (spoiler: no difference), and RAID 5 assumptions. > StorageMojo has a good summary of the paper's key points." > > > http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50 > > Monitoring Hard Disks with SMART > > By Bruce Allen on Thu, 2004-01-01 02:00. SysAdmin One of your hard disks > might be trying to tell you it's not long for this world. Install software > that lets you know when to replace it. > > It's a given that all disks eventually die, and it's easy to see why. The > platters in a modern disk drive rotate more than a hundred times per second, > maintaining submicron tolerances between the disk heads and the magnetic > media that store data. Often they run 24/7 in dusty, overheated > environments, thrashing on heavily loaded or poorly managed machines. So, > it's not surprising that experienced users are all too familiar with the > symptoms of a dying disk. Strange things start happening. Inscrutable kernel > error messages cover the console and then the system becomes unstable and > locks up. Often, entire days are lost repeating recent work, re-installing > the OS and trying to recover data. Even if you have a recent backup, sudden > disk failure is a minor catastrophe. > > http://smartmontools.sourceforge.net/ > > smartmontools Home Page > > Welcome! This is the home page for the smartmontools package. > > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos > >