[CentOS] HDD badblocks

Tue Jan 19 23:38:18 UTC 2016
Valeri Galtsev <galtsev at kicp.uchicago.edu>

On Tue, January 19, 2016 4:48 pm, John R Pierce wrote:
> On 1/19/2016 2:24 PM, Warren Young wrote:
>> It’s dying.  Replace it now.
> agreed
>> On a modern hard disk, you should*never*  see bad sectors, because the
>> drive is busy hiding all the bad sectors it does find, then telling you
>> everything is fine.
> thats not actually true.    the drive will report 'bad sector' if you
> try and read data that the drive simply can't read.   you wouldn't want
> it to return bad data and say its OK.     many(most?) drives won't
> actually remap to a bad sector until you write new data over that block
> number, since they don't want to copy bad data without any way of
> telling the OS the data is invalid.     these pending remaps are listed
> under smart parameter 197 Current_Pending_Sector

Apparently, you know more about modern drives than I do, but as far as I
know it is a bit longer story when bad block is discovered. Here it is.

Basically, bad blocks are discovered on read operation when CRC (cyclic
redundancy check) sum does not match. (in fact it is a bit more
sophisticated than just CRC, as modern high data density drives are trying
to match some analog signal they get on read head to digitally coded upon
record). When this discovery happens, firmware decides, this is a bad
block, adds its new location in badblock re-allocation table (a while ago
when I learned this this reallocation table was located in non-volatile
memory of drive controller board). Then firmware hold all other tasks and
tries to recover the information stored in bad block. It re-reads it and
superimposes read results until the CRC matches and then writes recovered
data into re-allocated place, or gives up after some large number of
attempts, then it writes whatever garbage it ends up with into
re-allocated place and reports fatal read error. This attempt of recovery
of bad blocks very noticeably slows down IO on device. So, "freezing" on
some IO when accessing files may be indication of developing of multiple
bad blocks. Time to replace the drive. The drive (even after irrecoverable
- fatal - read error) is still considered usable, only when bad block
re-allocation table fills up, the drive starts reporting that it is "out
of specs".

On a side note: even if CRC matches, it doesn't ensure that recovered data
is the same as data originally written. This is why filesystems that keep
sophisticated checksums of files are getting popular (zfs to name one).

Just my $0.02.


> --
> john r pierce, recycling bits in santa cruz
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos

Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247