[CentOS] Re: Question re RHEL 5.3

Thu Oct 30 13:53:54 UTC 2008
Robert Nichols <rnicholsNOSPAM at comcast.net>

MHR wrote:
> 
> The one problem I've seen and posted here was w.r.t. smartd error
> reports showing 2^32 - 1 errors on one of the disks (probably my
> system disk) every few minutes.  I thought this was more than just a
> bit suspicious, since there are only 4,687,500,000 sectors on a 300GB
> disk, and the likelihood of having errors on 4,294,967,295 (~92%) of
> them is rather slim unless the whole system is crashing a lot (it's
> not).  It's a Seagate 300GB, so I ran Seagate's SeaTools on it in
> lightweight mode, and no problems were reported, which is good because
> the disk is only about a year and a half old and has my CentOS root,
> swap, boot and home partitions on it.

Precisely what error counters are alarming you?  If these are the
raw numbers for Raw_Read_Error_Rate, Hardware_ECC_Recovered, and
Seek_Error_Rate, it is normal for Seagate drives.  Look at the
normalized values for these attributes.  As long as they are not
approaching their failure thresholds, the drive is OK.  For further
reassurance you can run the SMART long offline tests ("smartctl -t
long /dev/whatever" -- see smartctl manpage for details) on the
drive.

You need to understand something about modern drives.  In the past,
drives achieved the first level of redundancy by recording each bit
in a large enough area to include many magnetic domains.  If some
percentage of the domains failed to hold the data (a highly likely
situation), that was OK because the read head would get enough
signal from the rest of the domains so that the bit would be
detected correctly.  Fast forward to today.  That multi-domain
redundancy is all but gone, having been replaced by more advanced
error correcting codes implemented in hardware.  Seagate has elected
to have the raw number for Raw_Read_Error_Rate report each instance
of sectors needing this level of correction and let the normalized
values reflect whether these corrections are occurring at a rate
higher than expected.

A similar situation exists for Seek_Error_Rate.  When a drive
performs a seek, there is a trade-off between speed and accuracy.
You can make it more likely that the heads go directly to the right
track by moving them more slowly and allowing more settling time.
Performance can be improved significantly by moving the heads more
abruptly and accepting that some percentage of the time a subsequent
small adjustment will be needed to get to the right track.  Again,
it is the normalized value for Seek_Error_Rate that reports whether
these adjustments are becoming necessary more often than expected.


-- 
Bob Nichols     "NOSPAM" is really part of my email address.
                 Do NOT delete it.