At my physics lab we have 30 servers with 1TB disk packs. I am in need
of monitoring for disk failures. I have been reading about SMART and
it seems it can help. However, I am not sure what to look for if a
drive is about to fail. Any thoughts about this? Is anyone using this
method to predetermine disk failures?
It's a given that all disks eventually die, and it's easy to see why. The platters in a modern disk drive rotate more than a hundred times per second, maintaining submicron tolerances between the disk heads and the magnetic media that store data. Often they run 24/7 in dusty, overheated environments, thrashing on heavily loaded or poorly managed machines. So, it's not surprising that experienced users are all too familiar with the symptoms of a dying disk. Strange things start happening. Inscrutable kernel error messages cover the console and then the system becomes unstable and locks up. Often, entire days are lost repeating recent work, re-installing the OS and trying to recover data. Even if you have a recent backup, sudden disk failure is a minor catastrophe.
http://smartmontools.sourceforge.net/
Welcome! This is the home page for the smartmontools package.