[CentOS] OT - Offline uncorrectable sectors

Mon Aug 25 10:53:08 UTC 2008
William L. Maltby <CentOS4Bill at triad.rr.com>

On Mon, 2008-08-25 at 10:43 +0200, Lorenzo Quatrini wrote:
> William L. Maltby ha scritto:
> > 
> > Yep. Only a few copies of the superblock and the i-node tables are
> > written by the file system make process. That's why it's important for
> > files systems in critical applications to be created with the check
> > forced. Folks should also keep in mind that the default check, read
> > only, is really not sufficient for critical situations. The full
> > write/read check should be forced on *new* partitions/disks.
> > 

First, a correction. I earlier mentioned "-C" as causing the read/write
check for mke2fs. It is "-c -c". I must've been thinking of some other
FS software.

> 
> So again my question is:
> can I use dd to "test" the disk? what about
> 
> dd if=/dev/sda of=/dev/sda bs=512

It ought to do what you think it would. But ...

> 
> Is this safe on a full running system? Has to be done at runlevel 1 or with a
> live cd?

Safe on a full running system? Probably. I suggest a test before you do
it on an important system. I've never had the urge to do it the way you
suggest. It can be done at run level 1 or from a live CD too. But ..

> I think this is "better" than the manufacturer way, as dd is always present
> and works with any brand.

s/better/convenient/  # IMO

Now for the "buts". I presume that there are still two basic types of
media errors on HDs, "hard" and "soft". Hard errors are those that are
not recoverable through the normal hardware crc check process (or
whatever they use these days). Soft errors are errors that are
recoverable via the normal hardware crc check process.

Hard errors are always reported to the OS, soft errors are not, IIRC. So
you could have recovered media failures that do not get reported to the
OS. IF these failures are early indicators of deteriorating media you
will not be notified of them.

For this reason, hardware-specific diagnostic software is "better".
Further, the "smart" capabilities are *really* hardware specific and
will detect and report things that normal read/write activities, like
dd, cannot.

As to running on a live system, you might not want to for several
reasons. If you are using the system to do anything useful at the time,
there will be a big hit on responsiveness. Unlike the real original
UNIX, Linux still does not have preemptive scheduling (somebody please
correct me if I missed this potentially earth-shattering advancement -
last I heard, earliest was to be the 2.7 kernel, presuming no slippage).

Because dd is fast, it will consume all I/O capability, especially the
way you propose running it. Further, you will be causing a *LARGE*
number of system calls, further degrading system responsiveness. It
could be so slow to respond that one might think the system is "frozen".

If you insist on doing this, I would suggest something like

   nice <:your priority here:> dd if=/dev/xxxx of=/dev/xxxx bs=16384&

"Man nice" for details. This helps a little bit. I've not tried to see
how much responsiveness can be "recovered". A larger "bs=" will reduce
system calls, but will increase buffer sizes and usage and increase I/O
load. Even if you omit the trailing "&" to run in foreground, the
responsiveness may be so slow that a <CTL>-<C> may appear to fail and
make you think the system is "frozen"... for a little while.

The larger "bs=" would seem to negate what you want with the "bs=512".
Not so. Since the detection of failures happens on the hardware, it will
still detect failures and handle them as it normally would. The "bs=" is
only a blocking factor. Your "512" only saves doing math to figure out
what the "sector" really is. But it has a large cost. BTW, you don't
really know what the sector size is these days. It may not be 512. Back
in the old days, sector size was selectable via jumpers. Today I suspect
the drives don't have sectors in the same way/size as they used to.

Closing (really, they are!) arguments:
1. Any OS, rather than hardware specific, test will be less rigorous.
This is "optimal" only if other factors trump reliability. Usually
"convenience" and "portability" will not trump reliability for server or
critical platforms.

2. The "smart" feature has capabilities of which you may not be aware.
One of these is to run in such a way as to minimize performance impact
on a live system. If you've run "makewhatis", then "man -k smart" or
"apropos smart" will get you started on the reading you may want to do.

3. Hardware-specific diagnostics and repair utilities from the
manufacturer (this includes the "smart" capability of the drives) will
be more rigorous and reliable than general-purpose utilities.

4. The manufacturer utilities can "repair" media failures as they are
detected. If you are taking the time to run diagnostics, why not fix
failures at the same time? If you believe that the "dd" way can
accomplish the same thing (through the alternate block assignment
process), why not grab a drive with known bad sectors and run a test to
see if it will be satisfactory to you?

> 
> Lorenzo
> <snip sig stuff>

-- 
Bill