On Tue, Jan 19, 2016 at 3:24 PM, Warren Young <wyml at etr-usa.com> wrote: > On a modern hard disk, you should *never* see bad sectors, because the drive is busy hiding all the bad sectors it does find, then telling you everything is fine. This is not a given. Misconfiguration can make persistent bad sectors very common, and this misconfiguration is the default situation in RAID setups on Linux, which is why it's so common. This, and user error, are the top causes for RAID 5 implosion on Linux (both mdadm and lvm raid). The necessary sequence: 1. The drive needs to know the sector is bad. 2. The drive needs to be asked to read that sector. 3. The drive needs to give up trying to read that sector. 4. The drive needs to report the sector LBA back to the OS. 5. The OS needs to write something back to that same LBA. 6. The drive will write to the sector, and if it fails, will remap the LBA to a different (reserve) physical sector. Where this fails on Linux is step 3 and 4. By default consumer drives either don't support SCT ERC, such as in the case in this thread, or it's disabled. That condition means the time out for deep recovery of bad sectors can be very high, 2 or 3 minutes. Usually it's less than this, but often it's more than the kernel's default SCSI command timer. When a command to the drive doesn't complete successfully in the default of 30 seconds, the kernel resets the link to the drive, which obliterates the entire command queue contents and the work it was doing to recover the bad sector. Therefore step 4 never happens, and no steps after that either. Hence, bad sectors accumulate. And the consequence of this often doesn't get figured out until a user looks at kernel messages and sees a bunch of hard link resets and has a WTF moment, and asks questions. More often they don't see those reset messages, or they don't ask about them, so the next consequence is a drive fails. When it's a drive other than one with bad sectors, in effect there are two bad strips per stripe during reads (including rebuild) and that's when there's total array collapse even though there was only one bad drive. As a mask for this problem people are using RAID 6, but it's still a misconfiguration that can cause RAID6 failures too. >> Why smartctl does not update Reallocated_Event_Count? > > Because SMART lies. Nope. The drive isn't being asked to write to those bad sectors. If it can't successfully read the sector without error, it won't migrate the data on its own (some drives never do this). So it necessitates a write to the sector to cause the remap to happen. The other thing is the bad sector count on 512e AF drives is inflated. The number of bad sectors is in 512 byte sector increments. But there is no such thing on an AF drive. One bad physical sector will be reported as 8 bad sectors. And to fix the problem it requires writing exactly all 8 of those logical sectors at one time in a single command to the drive. Ergo I've had 'dd if=/dev/zero of=/dev/sda seek=blah count=8' fail with a read error, due to the command being internally reinterpreted as read-modify-write. Ridiculous but true. So you have to use bs=4096 and count=1, and of course adjust seek LBA to be based on 4096 bytes instead of 512. So the simplest fix here is: echo 160 /sys/block/sdX/device/timeout/ That's needed for each member drive. Note this is not a persistent setting. And then this: echo repair > /sys/block/mdX/md/sync_action That's once. You'll see the read errors in dmesg, and md writing back to the drive with the bad sector. This problem affects all software raid, including btrfs raid1. The ideal scenario is you'll use 'smartctl -l scterc,70,70 /dev/sdX' in startup script, so the drive fails reads on marginally bad sectors with an error in 7 seconds maximum. The linux-raid@ list if chock full of this as a recurring theme. -- Chris Murphy