On Wed, Jan 20, 2016, 7:17 AM Lamar Owen <lowen at pari.edu> wrote: > On 01/19/2016 06:46 PM, Chris Murphy wrote: > > Hence, bad sectors accumulate. And the consequence of this often > > doesn't get figured out until a user looks at kernel messages and sees > > a bunch of hard link resets.... > > The standard Unix way of refreshing the disk contents is with badblocks' > non-destructive read-write test (badblocks -n or as the -cc option to > e2fsck, for ext2/3/4 filesystems). This isn't applicable to RAID, which is what this thread is about. For RAID, use scrub, that's what is for. The badblocks method fixes nothing if the sector is persistently bad and the drive reports a read error. It fixes nothing if the command timeout is reached before the drive either recovers or reports a read error. And even if it works, you're relying on ECC recovered data rather than reading a likely good copy from mirror or parity and writing that back to the bad block. But all of this still requires the proper configuration. The remap will happen on the > writeback of the contents. It's been this way with enterprise SCSI > drives for as long as I can remember there being enterprise-class SCSI > drives. ATA drives caught up with the SCSI ones back in the early 90's > with this feature. But it's always been true, to the best of my > recollection, that the remap always happens on a write. Properly configured, first a read error happens which includes the LBA of the bad sector. The md driver needs that LBA to know how to find a good copy of data from mirror or from parity. *Then* it weird to the bad LBA. In the case of misconfiguration, the command timeout expiration and link reset prevents the kernel from knowing the LBA if the bad sector and therefore repair isn't possible. The rationale > is pretty simple: only on a write error does the drive know that it has > the valid data in its buffer, and so that's the only safe time to put > the data elsewhere. > > > This problem affects all software raid, including btrfs raid1. The > > ideal scenario is you'll use 'smartctl -l scterc,70,70 /dev/sdX' in > > startup script, so the drive fails reads on marginally bad sectors > > with an error in 7 seconds maximum. > > > This is partly why enterprise arrays manage their own per-sector ECC and > use 528-byte sector sizes. Not all enterprise drives have 520/528 byte sectors. Those that do are using T10-PI (formerly DIF) and it requires software support too. It's pretty rare. It's 8000% easier to use ZFS on Linux or Btrfs. > But the other fact of life of modern consumer-level hard drives is that > *errored sectors are expected* and not exceptions. Why else would a > drive have a TLER in the two minute range like many of the WD Green > drives do? And with a consumer-level drive I would be shocked if > badblocks reported the same number each time it ran through. > All drives expect bad sectors. Consumer drives reporting a read error will put the host OS into an inconsistent state, so it should be avoided. Becoming slow is better than implosion. And neither OS X or Windows do link resets after merely 30 seconds either. Chris Murphy