[CentOS] S.M.A.R.T

Sat Aug 30 10:07:28 UTC 2008

Rak,

Thanks! The Google paper is intense. I was hoping to get some
practical usage with command or scripts to better monitor my SMART
environment.

On Sat, Aug 30, 2008 at 4:57 AM, Richard Karhuse <rkarhuse at gmail.com> wrote:
>
>
> On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam <magawake at gmail.com> wrote:
>>
>> At my physics lab we have 30 servers with 1TB disk packs. I am in need
>> of monitoring for disk failures. I have been reading about SMART and
>> it seems it can help. However, I am not sure what to look for if a
>> drive is about to fail. Any thoughts about this? Is anyone using this
>> method to predetermine disk failures?
>
>
> Here are a few references from my archives w.r.t. SMART ...
>
> Hope they help ...
>
>    -rak-
>
> ====
>
> http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml
>
> Google Releases Paper on Disk Reliability
>
> "The Google engineers just published a paper on Failure Trends in a Large
> Disk Drive Population. Based on a study of 100,000 disk drives over 5 years
> they find some interesting stuff. To quote from the abstract: 'Our analysis
> identifies several parameters from the drive's self monitoring facility
> (SMART) that correlate highly with failures. Despite this high correlation,
> we conclude that models based on SMART parameters alone are unlikely to be
> useful for predicting individual drive failures. Surprisingly, we found that
> temperature and activity levels were much less correlated with drive
> failures than previously reported.'"
>
>
> http://hardware.slashdot.org/hardware/07/02/21/004233.shtml
>
> Everything You Know About Disks Is Wrong
>
> "Google's wasn't the best storage paper at FAST '07. Another, more
> provocative paper looking at real-world results from 100,000 disk drives got
> the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab,
> submitted Disk failures in the real world: What does an MTTF of 1,000,000
> hours mean to you? The paper crushes a number of (what we now know to be)
> myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise'
> drive reliability (spoiler: no difference), and RAID 5 assumptions.
> StorageMojo has a good summary of the paper's key points."
>
>
> http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50
>
> Monitoring Hard Disks with SMART
>
> By Bruce Allen on Thu, 2004-01-01 02:00. SysAdmin One of your hard disks
> might be trying to tell you it's not long for this world. Install software
> that lets you know when to replace it.
>
> It's a given that all disks eventually die, and it's easy to see why. The
> platters in a modern disk drive rotate more than a hundred times per second,
> maintaining submicron tolerances between the disk heads and the magnetic
> media that store data. Often they run 24/7 in dusty, overheated
> environments, thrashing on heavily loaded or poorly managed machines. So,
> it's not surprising that experienced users are all too familiar with the
> symptoms of a dying disk. Strange things start happening. Inscrutable kernel
> error messages cover the console and then the system becomes unstable and
> locks up. Often, entire days are lost repeating recent work, re-installing
> the OS and trying to recover data. Even if you have a recent backup, sudden
> disk failure is a minor catastrophe.
>
> http://smartmontools.sourceforge.net/
>
> smartmontools Home Page
>
> Welcome! This is the home page for the smartmontools package.
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
>