<div dir="ltr"><br><br><div class="gmail_quote">On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam <span dir="ltr"><<a href="mailto:magawake@gmail.com">magawake@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
At my physics lab we have 30 servers with 1TB disk packs. I am in need<br>
of monitoring for disk failures. I have been reading about SMART and<br>
it seems it can help. However, I am not sure what to look for if a<br>
drive is about to fail. Any thoughts about this? Is anyone using this<br>
method to predetermine disk failures?<br>
</blockquote><div><br> <br>Here are a few references from my archives w.r.t. SMART ...<br><br>Hope they help ...<br><br> -rak-<br><br>====<br><div style="margin-left: 40px;"><h3><br></h3></div><h3><a href="http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml" target="_blank">http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml</a><br>
</h3><div style="margin-left: 40px;"><h3>
<span class="nfakPe">Google</span> Releases Paper on <span class="nfakPe">Disk</span> Reliability</h3><i>"The <span class="nfakPe">Google</span> engineers just published a paper on <a href="http://labs.google.com/papers/disk_failures.pdf" target="_blank">Failure Trends in a Large <span class="nfakPe">Disk</span> Drive Population</a>.
Based on a study of 100,000 <span class="nfakPe">disk</span> drives over 5 years they find some
interesting stuff. To quote from the abstract: 'Our analysis identifies
several parameters from the drive's self monitoring facility (<span class="nfakPe">SMART</span>)
that correlate highly with failures. Despite this high correlation, we
conclude that models based on <span class="nfakPe">SMART</span> parameters alone are unlikely to be
useful for predicting individual drive failures. Surprisingly, we found
that temperature and activity levels were much less correlated with
drive failures than previously reported.'"<br><br><br></i></div><a href="http://hardware.slashdot.org/hardware/07/02/21/004233.shtml" target="_blank">http://hardware.slashdot.org/hardware/07/02/21/004233.shtml</a><br>
<br><div style="margin-left: 40px;">
<h3>
Everything You Know About Disks Is Wrong</h3><i>"<span class="nfakPe">Google</span>'s wasn't the best storage paper at <a href="http://www.usenix.org/events/fast07/" target="_blank">FAST '07</a>.
Another, more provocative paper looking at real-world results from
100,000 <span class="nfakPe">disk</span> drives got the 'Best Paper' award. Bianca Schroeder, of
<span class="nfakPe">CMU</span>'s Parallel Data Lab, submitted <a href="http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html" target="_blank"><span class="nfakPe">Disk</span> failures in the real world: What does an MTTF of 1,000,000 hours mean to you?</a>
The paper crushes a number of (what we now know to be) myths about
disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive
reliability (spoiler: no difference), and RAID 5 assumptions.
StorageMojo has <a href="http://storagemojo.com/?p=383" target="_blank">a good summary of the paper's key points</a>."</i><br></div><br><div style="margin-left: 40px;"><br></div><a href="http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50" target="_blank">http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50</a><br>
<br><div style="margin-left: 40px;">
<h1>Monitoring Hard Disks with <span class="nfakPe">SMART</span></h1>
<span>By <a href="http://www.linuxjournal.com/user/801273" title="View user profile." target="_blank">Bruce Allen</a> on Thu, 2004-01-01 02:00.</span>
<span><a href="http://www.linuxjournal.com/taxonomy/term/8" target="_blank">SysAdmin</a></span>
One
of your hard disks might be trying to tell you it's not long for this
world. Install software that lets you know when to replace it.
<div>
<h2><a name="113aad8f70c1f4ce_N0x850cd80.0x8573c9c"></a></h2>
</div>
<p>It's a given that all disks eventually die, and it's easy to see why. The platters in a modern <span class="nfakPe">disk</span> drive
rotate more than a hundred times per second, maintaining submicron tolerances between the <span class="nfakPe">disk</span> heads and the
magnetic media that store data. Often they run 24/7 in dusty, overheated environments, thrashing on heavily
loaded or poorly managed machines. So, it's not surprising that experienced users are all too familiar with
the symptoms of a dying <span class="nfakPe">disk</span>. Strange things start happening. Inscrutable kernel error messages cover the
console and then the system becomes unstable and locks up. Often, entire days are lost repeating recent work,
re-installing the OS and trying to recover data. Even if you have a recent backup, sudden <span class="nfakPe">disk</span> failure is a
minor catastrophe.</p><p><a href="http://smartmontools.sourceforge.net/" target="_blank">http://smartmontools.sourceforge.net/</a></p><br><div align="center"><h1><font color="#3333ff">smartmontools Home Page</font></h1>
</div>
<p>Welcome! This is the home page for the smartmontools package.</p></div><br><br></div></div></div>