Errors on an SSD drive

List overview All Threads
Download

newer

older

CentOS7 gnome apps look in Mate

BIND 9.9 RRL

Robert Moskowitz

9 Aug 2017 9 Aug '17

2:03 p.m.

I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Thanks

Show replies by date

Chris Murphy

9 Aug 9 Aug

2:46 p.m.

If it's a bad sector problem, you'd write to sector 17066160 and see if the drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I've only ever seen this with both a read error and a UNC error. So I'm not sure it's a bad sector.

What is DID_BAD_TARGET?

And what do you get for smartctl -x <dev>

Chris Murphy

On Wed, Aug 9, 2017, 8:03 AM Robert Moskowitz rgm@htt-consult.com wrote:

...

I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Thanks

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

Robert Moskowitz

10 Aug 10 Aug

12:48 p.m.

On 08/09/2017 10:46 AM, Chris Murphy wrote:

...

If it's a bad sector problem, you'd write to sector 17066160 and see if the drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I've only ever seen this with both a read error and a UNC error. So I'm not sure it's a bad sector.

What is DID_BAD_TARGET?

I have no experience on how to force a write to a specific sector and not cause other problems. I suspect that this sector is in the / partition:

Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x0000c89d

Device Boot Start End Blocks Id System /dev/sda1 2048 2099199 1048576 83 Linux /dev/sda2 2099200 4196351 1048576 82 Linux swap / Solaris /dev/sda3 4196352 468862127 232332888 83 Linux

But I don't know where it is in relation to the way the drive was formatted in my notebook. I think it would have been in the / partition.

...

And what do you get for smartctl -x <dev>

About 17KB of output? I don't know how to read what it is saying, but noted in the beginning:

Write SCT (Get) XXX Error Recovery Control Command failed: scsi error badly formed scsi parameters

Don't know what this means...

BTW, the system is a Cubieboard2 armv7 SoC running Centos7-armv7hl. This is the first time I have used an SSD on a Cubie, but I know it is frequently done. I would have to ask on the Cubie forum what others experience with SSDs have been.

...

Chris Murphy

On Wed, Aug 9, 2017, 8:03 AM Robert Moskowitz rgm@htt-consult.com wrote:

...
I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Thanks

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

Chris Murphy

4:06 p.m.

On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz rgm@htt-consult.com wrote:

...

On 08/09/2017 10:46 AM, Chris Murphy wrote:

...
If it's a bad sector problem, you'd write to sector 17066160 and see if

the

...
drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I've only ever seen this with both a read error and a UNC error. So I'm not sure it's a bad sector.

What is DID_BAD_TARGET?

I have no experience on how to force a write to a specific sector and not cause other problems. I suspect that this sector is in the / partition:

Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x0000c89d
Device Boot      Start         End      Blocks   Id  System
/dev/sda1 2048 2099199 1048576 83 Linux /dev/sda2 2099200 4196351 1048576 82 Linux swap / Solaris /dev/sda3 4196352 468862127 232332888 83 Linux

LBA 17066160 would be on sda3.

dd if=/dev/sda skip=17066160 count=1 2>/dev/null | hexdump -C

That'll read that sector and display hex and ascii. If you recognize the contents, it's probably user data. Otherwise, it's file system metadata or a system binary.

If you get nothing but an I/O error, then it's lost so it doesn't matter what it is, you can definitely overwrite it.

dd if=/dev/zero of=/dev/sda seek=17066160 count=1

If you want an extra confirmation, you can first do 'smartctl -t long /dev/sda' and then after the prescribed testing time, which is listed, check it again with 'smartct -a /dev/sda' and see if the test completed, or if under self-test log section, it shows it was aborted and lists a number under the LBA_of_first_error column.

...

But I don't know where it is in relation to the way the drive was formatted in my notebook. I think it would have been in the / partition.

...

...
And what do you get for smartctl -x <dev>

About 17KB of output?

Can you attach it as a file to the list? If the list won't accept the attachment, put it up on fpaste.org or pastebin or something like that. MUA's tend to nerf the output so don't paste it into an email.

...

I don't know how to read what it is saying, but noted in the beginning:

Write SCT (Get) XXX Error Recovery Control Command failed: scsi error badly formed scsi parameters

Don't know what this means...

BTW, the system is a Cubieboard2 armv7 SoC running Centos7-armv7hl. This is the first time I have used an SSD on a Cubie, but I know it is frequently done. I would have to ask on the Cubie forum what others experience with SSDs have been.

It's very common. I think this is just an ordinary bad sector, if that LBA value is consistent. If it's a new SSD it's slightly concerning. You can either keep an eye on it, or put a little pressure on the manufacturer or place of purchase that you have a bad sector and would like to swap out the unit.

SSD's, in particular SD Cards (which you're not using, which is noted as /dev/mmcblk0...) store you data as a probabilistic representation, and through a lot of magic, the probability of retrieving your data correctly from SSD is made very high. Almost deterministic.

The magic is in the firmware, and so there's some possibility any given SSD problem is related to a firmware bug. So it's worth comparing the firmware reported by smartctl and what the manufacturer has, and then their changelog. Most have a way to update firmware without Windows, but don't have images that will boot an arm board, usually the "universal" updater is based on FreeDOS funny enough. You'd need to stick the SSD in an x86 computer to do this. Hilariously perverse, I did this with a Samsung 830 SSD a while back, sticking it into a Macbook Pro, and burned that firmware ISO onto a DVD-RW, and it booted that Mac (using the firmware's BIOS compatibility support module) and updated the SSD's firmware without a problem.

Chris Murphy

mad.scientist.at.large＠tutanota.com

4:46 p.m.

is that because the drive is compressing the information? is there a way to turn this off? i hate mandatory compression as losing one bit in a compressed file tends to be a big deal compared to the same in an uncompressed file. -- Securely sent with Tutanota. Claim your encrypted mailbox today! https://tutanota.com

10. Aug 2017 10:06 by lists@colorremedies.com:

...

On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz <> rgm@htt-consult.com> > wrote:

...

SSD's, in particular SD Cards (which you're not using, which is noted as /dev/mmcblk0...) store you data as a probabilistic representation, and through a lot of magic, the probability of retrieving your data correctly from SSD is made very high. Almost deterministic.

The magic is in the firmware, and so there's some possibility any given SSD problem is related to a firmware bug. So it's worth comparing the firmware reported by smartctl and what the manufacturer has, and then their changelog. Most have a way to update firmware without Windows, but don't have images that will boot an arm board, usually the "universal" updater is based on FreeDOS funny enough. You'd need to stick the SSD in an x86 computer to do this. Hilariously perverse, I did this with a Samsung 830 SSD a while back, sticking it into a Macbook Pro, and burned that firmware ISO onto a DVD-RW, and it booted that Mac (using the firmware's BIOS compatibility support module) and updated the SSD's firmware without a problem.

Chris Murphy _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

Warren Young

8:21 p.m.

On Aug 10, 2017, at 10:46 AM, mad.scientist.at.large@tutanota.com wrote:

...

is that because the drive is compressing the information?

No. I believe by “probabilistic representation” the parent poster simply means that in any given data cell, you don’t have a hard “1” or “0”, you have some voltage potential which can be interpreted as some number of 1 or 0 bits, often 3 bits or more.

Between that fact and wear-leveling, you can’t take a simple voltage measurement on a data cell and say, “This cell contains 011.” You need more smarts about what’s going on to turn the voltage reading into the correct value.

As the drive’s data cells wear out, the drive’s ability to do that correctly and reliably degrade. Thus cell death, thus drive death, thus filesystem death, thus backups, else sadness.

And please don’t top-post.

A: Yes.

Q: Are you sure?

A: Because it makes the flow of conversation more difficult to read.

Q: Why shouldn’t I top-post?

m.roth＠5-cent.us

4:55 p.m.

Chris Murphy wrote:

...

On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz rgm@htt-consult.com wrote:

...
On 08/09/2017 10:46 AM, Chris Murphy wrote:

...
If it's a bad sector problem, you'd write to sector 17066160 and see

if the

...
drive complies or spits back a write error. It looks like a bad sector

in

...
that the same LBA is reported each time but I've only ever seen this

with

...
both a read error and a UNC error. So I'm not sure it's a bad sector.

What is DID_BAD_TARGET?

I have no experience on how to force a write to a specific sector and not cause other problems. I suspect that this sector is in the / partition:

Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x0000c89d
Device Boot      Start         End      Blocks   Id  System
/dev/sda1 2048 2099199 1048576 83 Linux /dev/sda2 2099200 4196351 1048576 82 Linux swap / Solaris /dev/sda3 4196352 468862127 232332888 83 Linux
LBA 17066160 would be on sda3.

dd if=/dev/sda skip=17066160 count=1 2>/dev/null | hexdump -C

That'll read that sector and display hex and ascii. If you recognize the contents, it's probably user data. Otherwise, it's file system metadata or a system binary.

<snip> Yeah, I was going to suggest you find out what that's part of. Try this link https://www.gra2.com/article.php/20041015232512624, which is about identifying what an unreadable sector is part of.

mark

Robert Nichols

11 Aug 11 Aug

1:53 p.m.

On 08/10/2017 11:06 AM, Chris Murphy wrote:

...

On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz rgm@htt-consult.com wrote:

...
On 08/09/2017 10:46 AM, Chris Murphy wrote:

...
If it's a bad sector problem, you'd write to sector 17066160 and see if

the

...
drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I've only ever seen this with both a read error and a UNC error. So I'm not sure it's a bad sector.

What is DID_BAD_TARGET?

I have no experience on how to force a write to a specific sector and not cause other problems. I suspect that this sector is in the / partition:

Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x0000c89d
 Device Boot      Start         End      Blocks   Id  System
/dev/sda1 2048 2099199 1048576 83 Linux /dev/sda2 2099200 4196351 1048576 82 Linux swap / Solaris /dev/sda3 4196352 468862127 232332888 83 Linux
LBA 17066160 would be on sda3.

dd if=/dev/sda skip=17066160 count=1 2>/dev/null | hexdump -C

That'll read that sector and display hex and ascii. If you recognize the contents, it's probably user data. Otherwise, it's file system metadata or a system binary.

If you get nothing but an I/O error, then it's lost so it doesn't matter what it is, you can definitely overwrite it.

dd if=/dev/zero of=/dev/sda seek=17066160 count=1

You really don't want to do that without first finding out what file is using that block. You will convert a detected I/O error into silent corruption of that file, and that is a much worse situation.

-- Bob Nichols "NOSPAM" is really part of my email address. Do NOT delete it.

Chris Murphy

5:16 p.m.

On Fri, Aug 11, 2017 at 7:53 AM, Robert Nichols rnicholsNOSPAM@comcast.net wrote:

...

On 08/10/2017 11:06 AM, Chris Murphy wrote:

...
On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz rgm@htt-consult.com wrote:

...
On 08/09/2017 10:46 AM, Chris Murphy wrote:

...
If it's a bad sector problem, you'd write to sector 17066160 and see if

the

...
drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I've only ever seen this with both a read error and a UNC error. So I'm not sure it's a bad sector.

What is DID_BAD_TARGET?

I have no experience on how to force a write to a specific sector and not cause other problems. I suspect that this sector is in the / partition:

Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x0000c89d
 Device Boot      Start         End      Blocks   Id  System
/dev/sda1 2048 2099199 1048576 83 Linux /dev/sda2 2099200 4196351 1048576 82 Linux swap / Solaris /dev/sda3 4196352 468862127 232332888 83 Linux
LBA 17066160 would be on sda3.

dd if=/dev/sda skip=17066160 count=1 2>/dev/null | hexdump -C

That'll read that sector and display hex and ascii. If you recognize the contents, it's probably user data. Otherwise, it's file system metadata or a system binary.

If you get nothing but an I/O error, then it's lost so it doesn't matter what it is, you can definitely overwrite it.

dd if=/dev/zero of=/dev/sda seek=17066160 count=1
You really don't want to do that without first finding out what file is using that block. You will convert a detected I/O error into silent corruption of that file, and that is a much worse situation.

Yeah he'd want to do an fsck -f and see if repairs are made, and also rpm -Va. There *will* be legitimately modified files, so it's going to be tedious to exactly sort out the ones that are legitimately modified vs corrupt. If it's a configuration file, I'd say you could ignore it but any modified binaries other than permissions need to be replaced and is the likely culprit.

The smartmontools page has hints on how to figure out what file is affected by a particular sector being corrupt but the more layers are involved the more difficult that gets. I'm not sure there's an easy to do this with LVM in between the physical device and file system.

-- Chris Murphy

Robert Nichols

7:07 p.m.

On 08/11/2017 12:16 PM, Chris Murphy wrote:

...

On Fri, Aug 11, 2017 at 7:53 AM, Robert Nichols rnicholsNOSPAM@comcast.net wrote:

...
On 08/10/2017 11:06 AM, Chris Murphy wrote:

...
On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz rgm@htt-consult.com wrote:

...
On 08/09/2017 10:46 AM, Chris Murphy wrote:

...
If it's a bad sector problem, you'd write to sector 17066160 and see if

the

...
drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I've only ever seen this with both a read error and a UNC error. So I'm not sure it's a bad sector.

What is DID_BAD_TARGET?

I have no experience on how to force a write to a specific sector and not cause other problems. I suspect that this sector is in the / partition:

Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x0000c89d
  Device Boot      Start         End      Blocks   Id  System
/dev/sda1 2048 2099199 1048576 83 Linux /dev/sda2 2099200 4196351 1048576 82 Linux swap / Solaris /dev/sda3 4196352 468862127 232332888 83 Linux
LBA 17066160 would be on sda3.

dd if=/dev/sda skip=17066160 count=1 2>/dev/null | hexdump -C

That'll read that sector and display hex and ascii. If you recognize the contents, it's probably user data. Otherwise, it's file system metadata or a system binary.

If you get nothing but an I/O error, then it's lost so it doesn't matter what it is, you can definitely overwrite it.

dd if=/dev/zero of=/dev/sda seek=17066160 count=1
You really don't want to do that without first finding out what file is using that block. You will convert a detected I/O error into silent corruption of that file, and that is a much worse situation.
Yeah he'd want to do an fsck -f and see if repairs are made, and also rpm -Va. There *will* be legitimately modified files, so it's going to be tedious to exactly sort out the ones that are legitimately modified vs corrupt. If it's a configuration file, I'd say you could ignore it but any modified binaries other than permissions need to be replaced and is the likely culprit.

The smartmontools page has hints on how to figure out what file is affected by a particular sector being corrupt but the more layers are involved the more difficult that gets. I'm not sure there's an easy to do this with LVM in between the physical device and file system.

fsck checks filesystem metadata, not the content of files. It is not going to detect that a file has had 512 bytes replaced by zeros. If the file is a non-configuration file installed from an RPM, then "rpm -Va" should flag it.

LVM certainly makes the procedure harder. Figuring out what filesystem block corresponds to that LBA is still possible, but you have to examine the LV layout in /etc/lvm/backup/ and learn more than you probably wanted to know about LVM.

-- Bob Nichols "NOSPAM" is really part of my email address. Do NOT delete it.

Warren Young

7:28 p.m.

On Aug 11, 2017, at 1:07 PM, Robert Nichols rnicholsNOSPAM@comcast.net wrote:

...

...
Yeah he'd want to do an fsck -f and see if repairs are madestem.

fsck checks filesystem metadata, not the content of files.

Chris might have been thinking of fsck -c or -k, which do various sorts of badblocks scans.

That’s still a poor alternative to strong data checksumming and Merkle tree structured filesystems, of course.

...

LVM certainly makes the procedure harder. Figuring out what filesystem block corresponds to that LBA is still possible, but you have to examine the LV layout in /etc/lvm/backup/ and learn more than you probably wanted to know about LVM.

Linux kernel 4.8 added a feature called reverse mapping which is intended to solve this problem.

In principle, this will let you get a list of files that are known to be corrupted due to errors at the block layer, then fix it by removing or overwriting those files. The block layer, DM, LVM2, and filesystem layers will then be able to understand that those blocks are no longer corrupt, therefore the filesystem is fine, as are all the possible layers in between.

This understanding is based on a question I asked and had answered on the Stratis-Docs GitHub issue tracker:

https://github.com/stratis-storage/stratis-docs/issues/53

We’ll see how well it works in practice. It is certainly possible in principle: ZFS does this today.

m.roth＠5-cent.us

7:32 p.m.

Robert Nichols wrote:

...

On 08/11/2017 12:16 PM, Chris Murphy wrote:

...
On Fri, Aug 11, 2017 at 7:53 AM, Robert Nichols rnicholsNOSPAM@comcast.net wrote:

...
On 08/10/2017 11:06 AM, Chris Murphy wrote:

...
On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz rgm@htt-consult.com wrote:

...
On 08/09/2017 10:46 AM, Chris Murphy wrote:

...
If it's a bad sector problem, you'd write to sector 17066160 and see if the drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I've only ever seen this with both a read error and a UNC error. So I'm not sure it's a bad sector.

<snip>

...

...
...
...
That'll read that sector and display hex and ascii. If you recognize the contents, it's probably user data. Otherwise, it's file system metadata or a system binary.

If you get nothing but an I/O error, then it's lost so it doesn't matter what it is, you can definitely overwrite it.

dd if=/dev/zero of=/dev/sda seek=17066160 count=1

You really don't want to do that without first finding out what file is using that block. You will convert a detected I/O error into silent corruption ofthat file, and that is a much worse situation.

Yeah he'd want to do an fsck -f and see if repairs are made, and also

<snip>

...

fsck checks filesystem metadata, not the content of files. It is not going to detect that a file has had 512 bytes replaced by zeros. If the file is a non-configuration file installed from an RPM, then "rpm -Va" should flag it.

LVM certainly makes the procedure harder. Figuring out what filesystem block corresponds to that LBA is still possible, but you have to examine the LV layout in /etc/lvm/backup/ and learn more than you probably wanted to know about LVM.

I posted a link yesterday - let me know if you want me to repost it - to someone's web page who REALLY knows about filesystems and sectors, and how to identify the file that a bad sector is part of.

And it works. I haven't needed it in a few years, but I have followed his directions, and identified the file on the bad sector.

mark

Robert Nichols

9:28 p.m.

On 08/11/2017 02:32 PM, m.roth@5-cent.us wrote:

...

Robert Nichols wrote:

...
On 08/11/2017 12:16 PM, Chris Murphy wrote:

...
On Fri, Aug 11, 2017 at 7:53 AM, Robert Nichols rnicholsNOSPAM@comcast.net wrote:

...
On 08/10/2017 11:06 AM, Chris Murphy wrote:

...
On Thu, Aug 10, 2017, 6:48 AM Robert Moskowitz rgm@htt-consult.com wrote:

...
On 08/09/2017 10:46 AM, Chris Murphy wrote: > > If it's a bad sector problem, you'd write to sector 17066160 and see > if the drive complies or spits back a write error. It looks like a bad > sector in that the same LBA is reported each time but I've only ever > seen this with both a read error and a UNC error. So I'm not sure > it's a bad sector.

<snip> >>>> That'll read that sector and display hex and ascii. If you recognize >>>> the >>>> contents, it's probably user data. Otherwise, it's file system >>>> metadata or >>>> a system binary. >>>> >>>> If you get nothing but an I/O error, then it's lost so it doesn't >>>> matter what it is, you can definitely overwrite it. >>>> >>>> dd if=/dev/zero of=/dev/sda seek=17066160 count=1 >>> >>> >>> You really don't want to do that without first finding out what file is >>> using that block. You will convert a detected I/O error into silent >>> corruption ofthat file, and that is a much worse situation. >> >> Yeah he'd want to do an fsck -f and see if repairs are made, and also <snip> > fsck checks filesystem metadata, not the content of files. It is not going > to detect that a file has had 512 bytes replaced by zeros. If the file > is a non-configuration file installed from an RPM, then "rpm -Va" should > flag it. > > LVM certainly makes the procedure harder. Figuring out what filesystem > block corresponds to that LBA is still possible, but you have to examine > the LV layout in /etc/lvm/backup/ and learn more than you probably wanted > to know about LVM.

I posted a link yesterday - let me know if you want me to repost it - to someone's web page who REALLY knows about filesystems and sectors, and how to identify the file that a bad sector is part of.

And it works. I haven't needed it in a few years, but I have followed his directions, and identified the file on the bad sector.

But, have you tried it when LVM is involved? That's an additional mapping layer for disk addresses that is not covered in the page you linked, which is just a partial copy of the smartmontools bad block HOWTO at https://www.smartmontools.org/browser/trunk/www/badblockhowto.xml . That smartmontools page does have a section that deals with LVM. I advise not looking at that on a full stomach.

-- Bob Nichols "NOSPAM" is really part of my email address. Do NOT delete it.

9 Aug 9 Aug

5:48 p.m.

Robert Moskowitz wrote:

...

I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Make sure the cables and power supply are ok. Try the drive in another machine that has a different controller to see if there is an incompatibility between the drive and the controller.

You could make a btrfs file system on the whole device: that should say that a trim operation is performed for the whole device. Maybe that helps.

If the errors persist, replace the drive. I�d use Intel SSDs because they seam to have the least problems with broken firmwares. Do not use SSDs with hardware RAID controllers unless the SSDs were designed for this application.

Mark Haney

5:55 p.m.

To be honest, I'd not try a btrfs volume on a notebook SSD. I did that on a couple of systems and it corrupted pretty quickly. I'd stick with xfs/ext4 if you manage to get the drive working again.

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Aug 9, 2017 at 1:48 PM, hw hw@gc-24.de wrote:

...

Robert Moskowitz wrote:

...
I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Make sure the cables and power supply are ok. Try the drive in another machine that has a different controller to see if there is an incompatibility between the drive and the controller.

You could make a btrfs file system on the whole device: that should say that a trim operation is performed for the whole device. Maybe that helps.

If the errors persist, replace the drive. I悲 use Intel SSDs because they seam to have the least problems with broken firmwares. Do not use SSDs with hardware RAID controllers unless the SSDs were designed for this application.

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

-- [image: photo] Mark Haney Network Engineer at NeoNova 919-460-3330 <(919)%20460-3330> (opt 1) • mark.haney@neonova.net www.neonova.net https://neonova.net/ https://www.facebook.com/NeoNovaNNS/ https://twitter.com/NeoNova_NNS http://www.linkedin.com/company/neonova-network-services

10 Aug 10 Aug

2:47 p.m.

Mark Haney wrote:

...

To be honest, I'd not try a btrfs volume on a notebook SSD. I did that on a couple of systems and it corrupted pretty quickly. I'd stick with xfs/ext4 if you manage to get the drive working again.

That was merely to see if a trim operation on the whole device would bring some improvement.

I have the system on SSDs at home and data on spinning disks, so far no problems with btrfs. Do I need to worry now?

...

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Aug 9, 2017 at 1:48 PM, hw hw@gc-24.de wrote:

...
Robert Moskowitz wrote:

...
I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Make sure the cables and power supply are ok. Try the drive in another machine that has a different controller to see if there is an incompatibility between the drive and the controller.

You could make a btrfs file system on the whole device: that should say that a trim operation is performed for the whole device. Maybe that helps.

If the errors persist, replace the drive. I悲 use Intel SSDs because they seam to have the least problems with broken firmwares. Do not use SSDs with hardware RAID controllers unless the SSDs were designed for this application.

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

Chris Murphy

3:22 p.m.

On Wed, Aug 9, 2017, 11:55 AM Mark Haney mark.haney@neonova.net wrote:

...

To be honest, I'd not try a btrfs volume on a notebook SSD. I did that on a couple of systems and it corrupted pretty quickly. I'd stick with xfs/ext4

if you manage to get the drive working again.

...

Sounds like a hardware problem. Btrfs is explicitly optimized for SSD, the maintainers worked for FusionIO for several years of its development. If the drive is silently corrupting data, Btrfs will pretty much immediately start complaining where other filesystems will continue. Bad RAM can also result in scary warnings where you don't with other filesytems. And I've been using it in numerous SSDs for years and NVMe for a year with zero problems.

On CentOS though, I'd get newer btrfs-progs RPM from Fedora, and use either an elrepo.org kernel, a Fedora kernel, or build my own latest long-term from kernel.org. There's just too much development that's happened since the tree found in RHEL/CentOS kernels.

Also FWIW Red Hat is deprecating Btrfs, in the RHEL 7.4 announcement. Support will be removed probably in RHEL 8. I have no idea how it'll affect CentOS kernels though. It will remain in Fedora kernels.

Anyway, blkdiscard can be used on an SSD, whole or partition to zero them out. And at least recent ext4 and XFS mkfs will do a blkdisard, same as mksfs.btrfs.

Chris Murphy

...

< https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm...

...
Virus-free. www.avast.com < https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm...

...
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Aug 9, 2017 at 1:48 PM, hw hw@gc-24.de wrote:

...
Robert Moskowitz wrote:

...
I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Make sure the cables and power supply are ok. Try the drive in another machine that has a different controller to see if there is an incompatibility between the drive and the controller.

You could make a btrfs file system on the whole device: that should say that a trim operation is performed for the whole device. Maybe that helps.

If the errors persist, replace the drive. I悲 use Intel SSDs because they seam to have the least problems with broken firmwares. Do not use SSDs with hardware RAID controllers unless the SSDs were designed for this application.

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

-- [image: photo] Mark Haney Network Engineer at NeoNova 919-460-3330 <(919)%20460-3330> (opt 1) • mark.haney@neonova.net www.neonova.net https://neonova.net/ https://www.facebook.com/NeoNovaNNS/ https://twitter.com/NeoNova_NNS http://www.linkedin.com/company/neonova-network-services _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

11 Aug 11 Aug

11:41 a.m.

Chris Murphy wrote:

...

On Wed, Aug 9, 2017, 11:55 AM Mark Haney mark.haney@neonova.net wrote:

...
To be honest, I'd not try a btrfs volume on a notebook SSD. I did that on a couple of systems and it corrupted pretty quickly. I'd stick with xfs/ext4

if you manage to get the drive working again.

...
Sounds like a hardware problem. Btrfs is explicitly optimized for SSD, the maintainers worked for FusionIO for several years of its development. If the drive is silently corrupting data, Btrfs will pretty much immediately start complaining where other filesystems will continue. Bad RAM can also result in scary warnings where you don't with other filesytems. And I've been using it in numerous SSDs for years and NVMe for a year with zero problems.

That´s one thing I´ve been wondering about: When using btrfs RAID, do you need to somehow monitor the disks to see if one has failed?

...

On CentOS though, I'd get newer btrfs-progs RPM from Fedora, and use either an elrepo.org kernel, a Fedora kernel, or build my own latest long-term from kernel.org. There's just too much development that's happened since the tree found in RHEL/CentOS kernels.

I can´t go with a more recent kernel version before NVIDIA has updated their drivers to no longer need fence.h (or what it was).

And I thought stuff gets backported, especially things as important as file systems.

...

Also FWIW Red Hat is deprecating Btrfs, in the RHEL 7.4 announcement. Support will be removed probably in RHEL 8. I have no idea how it'll affect CentOS kernels though. It will remain in Fedora kernels.

That would suck badly to the point at which I´d have to look for yet another distribution. The only one ramaining is arch.

What do they suggest as a replacement? The only other FS that comes close is ZFS, and removing btrfs alltogether would be taking living in the past too many steps too far.

...

Anyway, blkdiscard can be used on an SSD, whole or partition to zero them out. And at least recent ext4 and XFS mkfs will do a blkdisard, same as mksfs.btrfs.

Chris Murphy

...
< https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm...

...
Virus-free. www.avast.com < https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm...

...
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Aug 9, 2017 at 1:48 PM, hw hw@gc-24.de wrote:

...
Robert Moskowitz wrote:

...
I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Make sure the cables and power supply are ok. Try the drive in another machine that has a different controller to see if there is an incompatibility between the drive and the controller.

You could make a btrfs file system on the whole device: that should say that a trim operation is performed for the whole device. Maybe that helps.

If the errors persist, replace the drive. I悲 use Intel SSDs because they seam to have the least problems with broken firmwares. Do not use SSDs with hardware RAID controllers unless the SSDs were designed for this application.

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

-- [image: photo] Mark Haney Network Engineer at NeoNova 919-460-3330 <(919)%20460-3330> (opt 1) • mark.haney@neonova.net www.neonova.net https://neonova.net/ https://www.facebook.com/NeoNovaNNS/ https://twitter.com/NeoNova_NNS http://www.linkedin.com/company/neonova-network-services _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

Robert Moskowitz

10 Aug 10 Aug

12:53 p.m.

On 08/09/2017 01:48 PM, hw wrote:

...

Robert Moskowitz wrote:

...
I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Make sure the cables and power supply are ok. Try the drive in another machine that has a different controller to see if there is an incompatibility between the drive and the controller.

You could make a btrfs file system on the whole device: that should say that a trim operation is performed for the whole device. Maybe that helps.

This is a Centos7-armv7hl install which is done by dd the provided image onto a drive, so really can't alter the provided file systems much other than to resize them. What I have is:

Model: ATA KINGSTON SV300S3 (scsi) Disk /dev/sda: 240GB Sector size (logical/physical): 512B/512B Partition Table: msdos Disk Flags:

Number Start End Size Type File system Flags 1 1049kB 1075MB 1074MB primary ext3 2 1075MB 2149MB 1074MB primary linux-swap(v1) 3 2149MB 240GB 238GB primary ext4

...

If the errors persist, replace the drive. I´d use Intel SSDs because they seam to have the least problems with broken firmwares. Do not use SSDs with hardware RAID controllers unless the SSDs were designed for this application.

2:50 p.m.

Robert Moskowitz wrote:

...

On 08/09/2017 01:48 PM, hw wrote:

...
Robert Moskowitz wrote:

...
I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Make sure the cables and power supply are ok. Try the drive in another machine that has a different controller to see if there is an incompatibility between the drive and the controller.

You could make a btrfs file system on the whole device: that should say that a trim operation is performed for the whole device. Maybe that helps.

This is a Centos7-armv7hl install which is done by dd the provided image onto a drive, so really can't alter the provided file systems much other than to resize them. What I have is:

Perhaps there´s some incompatibility on this architecture.

BTW, that the cables sit tight doesn´t mean they are good.

...

Model: ATA KINGSTON SV300S3 (scsi) Disk /dev/sda: 240GB Sector size (logical/physical): 512B/512B Partition Table: msdos Disk Flags:

Number Start End Size Type File system Flags 1 1049kB 1075MB 1074MB primary ext3 2 1075MB 2149MB 1074MB primary linux-swap(v1) 3 2149MB 240GB 238GB primary ext4

...
If the errors persist, replace the drive. I´d use Intel SSDs because they seam to have the least problems with broken firmwares. Do not use SSDs with hardware RAID controllers unless the SSDs were designed for this application.

Eliezer Croitoru

12:44 a.m.

I have yet to see a SSD read\write error which wasn't related to disk issues like a bad sector but the controller might have an issue with the drive. To verify it you will need to burn some read\write IOPS of the drive but if it's under warranty then it's better to verify it now then later.

Eliezer

---- Eliezer Croitoru Linux System Administrator Mobile: +972-5-28704261 Email: eliezer@ngtech.co.il

-----Original Message----- From: CentOS [mailto:centos-bounces@centos.org] On Behalf Of Robert Moskowitz Sent: Wednesday, August 9, 2017 17:03 To: CentOS mailing list centos@centos.org Subject: [CentOS] Errors on an SSD drive

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Thanks

_______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

mad.scientist.at.large＠tutanota.com

2:44 a.m.

what file system are you using? ssd drives have different characteristics that need to be accomadated (including a relatively slow write process which is obvious as soon as the buffer is full), and never, never put a swap partition on it, the high activity will wear it out rather quickly. might also check cables, often a problem particularly if they are older sata cables being run at a possibly higher than rated speed. in any case, reformating it might not be a bad idea, and you can always use the command line program badblocks to exercise and test it. keep in mind the drive will invisibly remap any bad sectors if possible. if the reported size of the drive is smaller than it should be the drive has run out of spare blocks and dying blocks are being removed from the storage place with no replacements.

-- Securely sent with Tutanota. Claim your encrypted mailbox today! https://tutanota.com

9. Aug 2017 18:44 by eliezer@ngtech.co.il:

...

I have yet to see a SSD read\write error which wasn't related to disk issues like a bad sector but the controller might have an issue with the drive. To verify it you will need to burn some read\write IOPS of the drive but if it's under warranty then it's better to verify it now then later.

Eliezer

Eliezer Croitoru Linux System Administrator Mobile: +972-5-28704261 Email: > eliezer@ngtech.co.il

-----Original Message----- From: CentOS [> mailto:centos-bounces@centos.org> ] On Behalf Of Robert Moskowitz Sent: Wednesday, August 9, 2017 17:03 To: CentOS mailing list <> centos@centos.org> > Subject: [CentOS] Errors on an SSD drive

I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Thanks

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

John Hodrien

8:07 a.m.

On Thu, 10 Aug 2017, mad.scientist.at.large@tutanota.com wrote:

...

what file system are you using? ssd drives have different characteristics that need to be accomadated (including a relatively slow write process which is obvious as soon as the buffer is full), and never, never put a swap partition on it, the high activity will wear it out rather quickly.

I know this is common doctrine, but is this still generally held true?

For a well configured desktop that rarely needs to swap, I struggle to see the load on the SSD as being significant, and yet obviously the performance of an SSD would make it ideal for swap.

...

might also check cables, often a problem particularly if they are older sata cables being run at a possibly higher than rated speed. in any case, reformating it might not be a bad idea, and you can always use the command line program badblocks to exercise and test it.

Exercising an SSD?

smartctl will give you sensible information on what the drive thinks of itself, and will give you actual figures on wear levelling and such like.

...

keep in mind the drive will invisibly remap any bad sectors if possible. if the reported size of the drive is smaller than it should be the drive has run out of spare blocks and dying blocks are being removed from the storage place with no replacements.

Coo, I've never seen a disk actually shrink due to failed sectors. I don't think I've got an SSD into a worn state yet to see this.

Warren Young

8:12 p.m.

On Aug 10, 2017, at 2:07 AM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:

...

For a well configured desktop that rarely needs to swap, I struggle to see the load on the SSD as being significant, and yet obviously the performance of an SSD would make it ideal for swap.

I agree.

It’s a bad idea to do without swap even if you almost never use it, because today’s bloated apps often have many pages of virtual memory they rarely or never actually touch. You want those pages to get swapped out quickly so that the precious RAM can be used more productively; by the buffer cache, if nothing else.

I once used a web application server on a headless VPS that still had GUI libraries linked to its binary because one of the underlying technologies it uses was also used in a GUI app, and it was too difficult to tear all that GUI code out, even if it was never called. Because the VPS technology didn’t support swap, I directly paid the price for those megs of unused (and unusable!) libraries in my monthly VPS rental fees.

...

Coo, I've never seen a disk actually shrink due to failed sectors. I don't think I've got an SSD into a worn state yet to see this.

Me, neither. I’m pretty sure the spare sector pool’s size isn’t reported to the OS, and the drive isn’t allowed to dip into the sectors it does expose externally for spares.

When the spare pool is used up, the drive just starts failing in a way that even SMART can see.

John R Pierce

8:17 p.m.

On 8/10/2017 1:12 PM, Warren Young wrote:

...

It’s a bad idea to do without swap even if you almost never use it, because today’s bloated apps often have many pages of virtual memory they rarely or never actually touch. You want those pages to get swapped out quickly so that the precious RAM can be used more productively; by the buffer cache, if nothing else.

most modern virtual memory OS's don't swap out unused pages, instead, they swap IN accessed pages directly from the executable file. only thing written to swap are 'dirty' pages that have been changed since loading.

-- john r pierce, recycling bits in santa cruz

J Martin Rushton

9:45 p.m.

On 10/08/17 21:17, John R Pierce wrote:

...

On 8/10/2017 1:12 PM, Warren Young wrote:

...
It’s a bad idea to do without swap even if you almost never use it, because today’s bloated apps often have many pages of virtual memory they rarely or never actually touch. You want those pages to get swapped out quickly so that the precious RAM can be used more productively; by the buffer cache, if nothing else.

most modern virtual memory OS's don't swap out unused pages, instead, they swap IN accessed pages directly from the executable file. only thing written to swap are 'dirty' pages that have been changed since loading.

Modern? They've been doing that since I did my VMS theory 30-odd years ago.

Warren Young

10:10 p.m.

On Aug 10, 2017, at 2:17 PM, John R Pierce pierce@hogranch.com wrote:

...

On 8/10/2017 1:12 PM, Warren Young wrote:

...
You want those pages to get swapped out quickly so that the precious RAM can be used more productively; by the buffer cache, if nothing else.

most modern virtual memory OS's don't swap out unused pages, instead, they swap IN accessed pages directly from the executable file. only thing written to swap are 'dirty' pages that have been changed since loading.

Is that not a distinction without a difference in my case?

Let’s say I have a system with 256 MB of free user-space RAM, and I have a binary that happens to be nearly 256 MB on disk, between the main executable and all the libraries it uses.

Question: Can my program allocate any dynamic RAM?

The OS’s VMM is free to use addresses beyond 0-256 MB, but since we’ve said there is no swap space, everything swapped in must still be assigned a place in physical RAM *somewhere*.

Is there a meaningful distinction between:

Scenario 1: The application’s first few executable pages are loaded from disk, a few key libraries are loaded, then the application does a dynamic memory allocation, then somehow causes all the rest of the executable pages to be loaded, running the system out of RAM.

Scenario 2: The application is entirely loaded into RAM, nearly filling it, then the application attempts a large dynamic memory allocation, causing an OOM error.

Regardless of the answer to these questions, I can tell you that switching that web site to a more efficient web application stack allowed us to shrink the VPS from a 256 MB plan, under which it would occasionally crash and require a reboot, to a 64 MB plan, under which the site has been rock-solid. Same VPS provider, same web site content, same user-facing functionality.

If I’d had the ability to assign swap space, I probably could have gotten away with a 64 MB VPS plan with the inefficient web technology, too. They gave me plenty of disk space with that plan.

(And no, swapon /some-file is no solution here. The VPS technology simply didn’t allow swap space, even from a swap file on one of the system disks. It wasn’t simply an inability to add a swap partition.)

Robert Moskowitz

12:57 p.m.

On 08/09/2017 10:44 PM, mad.scientist.at.large@tutanota.com wrote:

...

what file system are you using? ssd drives have different characteristics that need to be accomadated (including a relatively slow write process which is obvious as soon as the buffer is full), and never, never put a swap partition on it, the high activity will wear it out rather quickly. might also check cables, often a problem particularly if they are older sata cables being run at a possibly higher than rated speed.

When working with a Cubieboard SoC (or most of the other armv7 boards), you tend to have everything hanging out: http://medon.htt-consult.com/~rgm/cubieboard/cubietower-2.JPG

I have checked the cables and they are all tight.

...

in any case, reformating it might not be a bad idea, and you can always use the command line program badblocks to exercise and test it.

I will have to look into that.

...

keep in mind the drive will invisibly remap any bad sectors if possible. if the reported size of the drive is smaller than it should be the drive has run out of spare blocks and dying blocks are being removed from the storage place with no replacements.

-- Securely sent with Tutanota. Claim your encrypted mailbox today! https://tutanota.com

Aug 2017 18:44 by eliezer@ngtech.co.il:

...
I have yet to see a SSD read\write error which wasn't related to disk issues like a bad sector but the controller might have an issue with the drive. To verify it you will need to burn some read\write IOPS of the drive but if it's under warranty then it's better to verify it now then later.

Eliezer

Eliezer Croitoru Linux System Administrator Mobile: +972-5-28704261 Email: > eliezer@ngtech.co.il

-----Original Message----- From: CentOS [> mailto:centos-bounces@centos.org> ] On Behalf Of Robert Moskowitz Sent: Wednesday, August 9, 2017 17:03 To: CentOS mailing list <> centos@centos.org> > Subject: [CentOS] Errors on an SSD drive

I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). Centos install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168177.011615] blk_update_request: I/O error, dev sda, sector 17066160 [168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168487.551206] blk_update_request: I/O error, dev sda, sector 17066160 [168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0 00 00 08 00 [168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a 'reboot'. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Thanks

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

m.roth＠5-cent.us

2:31 p.m.

Robert Moskowitz wrote:

...

On 08/09/2017 10:44 PM, mad.scientist.at.large@tutanota.com wrote:

...

...
what file system are you using? ssd drives have different characteristics that need to be accomadated (including a relatively slow write process which is obvious as soon as the buffer is full), and never, never put a swap partition on it, the high activity will wear it out rather quickly. might also check cables, often a problem particularly if they are older sata cables being run at a possibly higher than rated speed.

When working with a Cubieboard SoC (or most of the other armv7 boards), you tend to have everything hanging out: http://medon.htt-consult.com/~rgm/cubieboard/cubietower-2.JPG

I have checked the cables and they are all tight.

...
in any case, reformating it might not be a bad idea, and you can always use the command line program badblocks to exercise and test it.

I will have to look into that.

Here's a thought: I've not done this, but could you use smartctl to check the drive?

mark

Robert Moskowitz

2:56 p.m.

On 08/10/2017 10:31 AM, m.roth@5-cent.us wrote:

...

Robert Moskowitz wrote:

...
On 08/09/2017 10:44 PM, mad.scientist.at.large@tutanota.com wrote:

...
what file system are you using? ssd drives have different characteristics that need to be accomadated (including a relatively slow write process which is obvious as soon as the buffer is full), and never, never put a swap partition on it, the high activity will wear it out rather quickly. might also check cables, often a problem particularly if they are older sata cables being run at a possibly higher than rated speed.

When working with a Cubieboard SoC (or most of the other armv7 boards), you tend to have everything hanging out: http://medon.htt-consult.com/~rgm/cubieboard/cubietower-2.JPG

I have checked the cables and they are all tight.

...
in any case, reformating it might not be a bad idea, and you can always use the command line program badblocks to exercise and test it.

I will have to look into that.

Here's a thought: I've not done this, but could you use smartctl to check the drive?

Other than the 17K output from smartctl -x, what do you recommend?

John Hodrien

3:01 p.m.

On Thu, 10 Aug 2017, Robert Moskowitz wrote:

...

Other than the 17K output from smartctl -x, what do you recommend?

smartctl -a is a little easier on the eye.

2992

Age (days ago)

2994

Last active (days ago)

discuss@lists.centos.org

30 comments

12 participants

tags (0)

participants (12)

Chris Murphy
Eliezer Croitoru
hw
J Martin Rushton
John Hodrien
John R Pierce
m.roth＠5-cent.us
mad.scientist.at.large＠tutanota.com
Mark Haney
Robert Moskowitz
Robert Nichols
Warren Young