[CentOS] OT - Offline uncorrectable sectors

Mon Aug 25 18:53:47 UTC 2008
Nifty Cluster Mitch <niftycluster at niftyegg.com>

On Mon, Aug 25, 2008 at 10:43:01AM +0200, Lorenzo Quatrini wrote:
> William L. Maltby ha scritto:
> > 
> > Yep. Only a few copies of the superblock and the i-node tables are
> > written by the file system make process. That's why it's important for
> > files systems in critical applications to be created with the check
> > forced. Folks should also keep in mind that the default check, read
> > only, is really not sufficient for critical situations. The full
> > write/read check should be forced on *new* partitions/disks.
> > 
> 
> So again my question is:
> can I use dd to "test" the disk? what about
> 
> dd if=/dev/sda of=/dev/sda bs=512
> 
> Is this safe on a full running system? Has to be done at runlevel 1 or with a
> live cd?
> I think this is "better" than the manufactureur way, as dd is always present
> and works with any brand.

It is not safe on a mounted filesystem or devices with mounted filesystems.

File system code on a partition will have no coherency interaction
with the entire raw device.

See the -f flag in the "badblocks" man page:
         "-f    Normally, badblocks will refuse to do a  read/write  or  a  non-
              destructive  test on a device which is mounted, since either can
              cause the system to potentially crash and/or damage the filesys-
              tem  even  if ....."

It is also not 100% clear to me that the kernel buffer code will not
see a paired set of "dd" commands as a no op and skip the write.

Vendor tools on an unmounted disk operate at a raw level and also have
access to the vendor specific embedded controller commands bypassing
buffering and directly interacting with error codes and retry counts and more.

In normal operation the best opportunity to spare a sector or track is
on a write.....   At that time the OS, and disk both have known good data
so a read after write can detect the defect/ error and take the necessary
action without loss of data.   Some disks have read heads that follow the
write heads to this end.  Other disks require an additional revolution....

When "mke2fs -c -c " is invoked the second -c flag is important because the
paired read/write can let the firmware on the disk map detected defects
to spares.   With a single "-c" flag the Linux filesystem code can
assign the error blocks to non files .   A system admin that does a dd read
of a problem disk may find that the OS hurls on the errors and takes the device off line.
i.e. this command:
	dd if=/dev/sda of=/dev/sda bs=512
might not do the expected because the first read can take the device
off line negating the follow up write intended to fix things.

The tool "hdparm: is rich in info -- some flags are dangerous.

Bottom line... use vendor tools....
Vendors like error reports from their tools for RMA processing and warranty...

BTW: smartd is a good thing.  For me any disk that smartd had made noise 
about has failed...  often with weeks or months of warning... 


-- 
	T o m  M i t c h e l l 
	Got a great hat... now what.