[CentOS] HDD badblocks

On Tue, Jan 19, 2016 at 3:24 PM, Warren Young <wyml at etr-usa.com> wrote:

> On a modern hard disk, you should *never* see bad sectors, because the drive is busy hiding all the bad sectors it does find, then telling you everything is fine.

This is not a given. Misconfiguration can make persistent bad sectors
very common, and this misconfiguration is the default situation in
RAID setups on Linux, which is why it's so common. This, and user
error, are the top causes for RAID 5 implosion on Linux (both mdadm
and lvm raid). The necessary sequence:

1. The drive needs to know the sector is bad.
2. The drive needs to be asked to read that sector.
3. The drive needs to give up trying to read that sector.
4. The drive needs to report the sector LBA back to the OS.
5. The OS needs to write something back to that same LBA.
6. The drive will write to the sector, and if it fails, will remap the
LBA to a different (reserve) physical sector.

Where this fails on Linux is step 3 and 4. By default consumer drives
either don't support SCT ERC, such as in the case in this thread, or
it's disabled. That condition means the time out for deep recovery of
bad sectors can be very high, 2 or 3 minutes. Usually it's less than
this, but often it's more than the kernel's default SCSI command
timer. When a command to the drive doesn't complete successfully in
the default of 30 seconds, the kernel resets the link to the drive,
which obliterates the entire command queue contents and the work it
was doing to recover the bad sector. Therefore step 4 never happens,
and no steps after that either.

Hence, bad sectors accumulate. And the consequence of this often
doesn't get figured out until a user looks at kernel messages and sees
a bunch of hard link resets and has a WTF moment, and asks questions.
More often they don't see those reset messages, or they don't ask
about them, so the next consequence is a drive fails. When it's a
drive other than one with bad sectors, in effect there are two bad
strips per stripe during reads (including rebuild) and that's when
there's total array collapse even though there was only one bad drive.
As a mask for this problem people are using RAID 6, but it's still a
misconfiguration that can cause RAID6 failures too.

>> Why smartctl does not update Reallocated_Event_Count?
>
> Because SMART lies.

Nope. The drive isn't being asked to write to those bad sectors. If it
can't successfully read the sector without error, it won't migrate the
data on its own (some drives never do this). So it necessitates a
write to the sector to cause the remap to happen.

The other thing is the bad sector count on 512e AF drives is inflated.
The number of bad sectors is in 512 byte sector increments. But there
is no such thing on an AF drive. One bad physical sector will be
reported as 8 bad sectors. And to fix the problem it requires writing
exactly all 8 of those logical sectors at one time in a single command
to the drive. Ergo I've had 'dd if=/dev/zero of=/dev/sda seek=blah
count=8' fail with a read error, due to the command being internally
reinterpreted as read-modify-write. Ridiculous but true. So you have
to use bs=4096 and count=1, and of course adjust seek LBA to be based
on 4096 bytes instead of 512.

So the simplest fix here is:

echo 160 /sys/block/sdX/device/timeout/

That's needed for each member drive. Note this is not a persistent
setting. And then this:

echo repair > /sys/block/mdX/md/sync_action

That's once. You'll see the read errors in dmesg, and md writing back
to the drive with the bad sector.

This problem affects all software raid, including btrfs raid1. The
ideal scenario is you'll use 'smartctl -l scterc,70,70 /dev/sdX' in
startup script, so the drive fails reads on marginally bad sectors
with an error in 7 seconds maximum.

The linux-raid@ list if chock full of this as a recurring theme.

-- 
Chris Murphy