[CentOS] Errors on an SSD drive

On 08/11/2017 12:16 PM, Chris Murphy wrote:
> On Fri, Aug 11, 2017 at 7:53 AM, Robert Nichols
> <rnicholsNOSPAM at comcast.net> wrote:
>> On 08/10/2017 11:06 AM, Chris Murphy wrote:
>>>
>>> On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz <rgm at htt-consult.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 08/09/2017 10:46 AM, Chris Murphy wrote:
>>>>>
>>>>> If it's a bad sector problem, you'd write to sector 17066160 and see if
>>>>
>>>> the
>>>>>
>>>>> drive complies or spits back a write error. It looks like a bad sector
>>>>> in
>>>>> that the same LBA is reported each time but I've only ever seen this
>>>>> with
>>>>> both a read error and a UNC error. So I'm not sure it's a bad sector.
>>>>>
>>>>> What is DID_BAD_TARGET?
>>>>
>>>>
>>>> I have no experience on how to force a write to a specific sector and
>>>> not cause other problems.  I suspect that this sector is in the /
>>>> partition:
>>>>
>>>> Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors
>>>> Units = sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>>> Disk label type: dos
>>>> Disk identifier: 0x0000c89d
>>>>
>>>>       Device Boot      Start         End      Blocks   Id  System
>>>> /dev/sda1            2048     2099199     1048576   83  Linux
>>>> /dev/sda2         2099200     4196351     1048576   82  Linux swap /
>>>> Solaris
>>>> /dev/sda3         4196352   468862127   232332888   83  Linux
>>>>
>>>
>>> LBA 17066160 would be on sda3.
>>>
>>> dd if=/dev/sda skip=17066160 count=1 2>/dev/null | hexdump -C
>>>
>>> That'll read that sector and display hex and ascii. If you recognize the
>>> contents, it's probably user data. Otherwise, it's file system metadata or
>>> a system binary.
>>>
>>> If you get nothing but an I/O error, then it's lost so it doesn't matter
>>> what it is, you can definitely overwrite it.
>>>
>>> dd if=/dev/zero of=/dev/sda seek=17066160 count=1
>>
>>
>> You really don't want to do that without first finding out what file is
>> using
>> that block. You will convert a detected I/O error into silent corruption of
>> that file, and that is a much worse situation.
> 
> Yeah he'd want to do an fsck -f and see if repairs are made, and also
> rpm -Va. There *will* be legitimately modified files, so it's going to
> be tedious to exactly sort out the ones that are legitimately modified
> vs corrupt. If it's a configuration file, I'd say you could ignore it
> but any modified binaries other than permissions need to be replaced
> and is the likely culprit.
> 
> The smartmontools page has hints on how to figure out what file is
> affected by a particular sector being corrupt but the more layers are
> involved the more difficult that gets. I'm not sure there's an easy to
> do this with LVM in between the physical device and file system.

fsck checks filesystem metadata, not the content of files. It is not going
to detect that a file has had 512 bytes replaced by zeros. If the file
is a non-configuration file installed from an RPM, then "rpm -Va" should
flag it.

LVM certainly makes the procedure harder. Figuring out what filesystem
block corresponds to that LBA is still possible, but you have to examine
the LV layout in /etc/lvm/backup/ and learn more than you probably wanted
to know about LVM.

-- 
Bob Nichols     "NOSPAM" is really part of my email address.
                 Do NOT delete it.