[CentOS] HDD badblocks

Tue Jan 19 22:24:10 UTC 2016
Warren Young <wyml at etr-usa.com>

On Jan 17, 2016, at 9:59 AM, Alessandro Baggi <alessandro.baggi at gmail.com> wrote:
> 
> On sdb there are not problem but with sda:
> 
> 1) First run badblocks reports 28 badblocks on disk
> 2) Second run badblocks reports 32 badblocks
> 3) Third reports 102 badblocks
> 4) Last run reports 92 badblocks.

It’s dying.  Replace it now.

On a modern hard disk, you should *never* see bad sectors, because the drive is busy hiding all the bad sectors it does find, then telling you everything is fine.

Once the drive has swept so many problems under the rug that it is forced to admit to normal user space programs (e.g. badblocks) that there are bad sectors, it’s because the spare sector pool is full.  At that point, the only safe remediation is to replace the disk.

> Running smartctl after the last badblocks check I've noticed that Current_Pending_Sector was 32 (not 92 as badblocks found).

SMART is allowed to lie to you.  That’s why there’s the RAW_VALUE column, yet there is no explanation in the manual as to what that value means.  The reason is, the low-level meanings of these values are documented by the drive manufacturers.  “92” is not necessarily a sector count.  For all you know, it is reporting that there are currently 92 lemmings in midair off the fjords of Finland.

The only important results here are:

a) the numbers are nonzero
b) the numbers are changing

That is all.  A zero value just means it hasn’t failed *yet*, and a static nonzero value means the drive has temporarily arrested its failures-in-progress.

There is no such thing as a hard drive with zero actual bad sectors, just one that has space left in its spare sector pool.  A “working” drive is one that is swapping sectors from the spare pool rarely enough that it is expected not to empty the pool before the warranty expires.

> Why each consecutive run of badblocks reports different results?

Because physics.  The highly competitive nature of the HDD business plus the relentless drive of Moore’s Business Law — as it should be called, since it is not a physical law, just an arbitrary fiction that the tech industry has bought into as the ground rules for the game — pushes the manufacturers to design them right up against the ragged edge of functionality.

HDD manufacturers could solve all of this by making them with 1/4 the capacity and twice the cost and get 10x the reliability.  And they do: they’re called SAS drives. :)

> Why smartctl does not update Reallocated_Event_Count?

Because SMART lies.

> What other test I can perform to verify disks problems?

Quit poking the tiger to see if it will bite you.  Replace the bad disk and resilver that mirror before you lose the other disk, too.