OT - Offline uncorrectable sectors - Discuss

List overview All Threads
Download

newer

OT - Offline uncorrectable sectors

older

RADVD address timeouts

apache

Lorenzo Quatrini

22 Aug 2008 22 Aug '08

3:55 p.m.

I have few disk that have offline uncorrectables sectors;

I found on this page how to identify the sectors and force a write on them to trigger the relocation of bad sectors on the disk:

http://smartmontools.sourceforge.net/BadBlockHowTo.txt

My question is:

since I'm too lazy to follow all the procedure, do you think that a force rewrite of the full disk would work?

Eg. "dd if=/dev/sda pf=/dev/sda bs=512"

Shoudl this be done at runlevel 1 or offline or I can do it without too many worries, since I'm reading and rewriting the same data on the disk?

TIA and sorry for the OT

Lorenzo Quatrini

Show replies by date

nate

22 Aug 22 Aug

3:59 p.m.

Lorenzo Quatrini wrote:

...

I have few disk that have offline uncorrectables sectors;

Ideally it should be done using the manufacturer's tools, and really any disk that has even one bad sector that the OS can see should not be relied upon, it should be considered a failed disk. Disks automatically keep spare sectors that the operating system cannot see and re-maps bad sectors to them, if your seeing bad sectors that means that collection of spares has been exhausted. I've never seen a disk manufacturer not accept a disk that had bad sectors on it (that was still under warranty) in as long as I can remember..

nate

Lorenzo Quatrini

4:07 p.m.

nate ha scritto:

...

Lorenzo Quatrini wrote:

...
I have few disk that have offline uncorrectables sectors;

Ideally it should be done using the manufacturer's tools, and really any disk that has even one bad sector that the OS can see should not be relied upon, it should be considered a failed disk. Disks automatically keep spare sectors that the operating system cannot see and re-maps bad sectors to them, if your seeing bad sectors that means that collection of spares has been exhausted. I've never seen a disk manufacturer not accept a disk that had bad sectors on it (that was still under warranty) in as long as I can remember..

nate

For what I understand Offline uncorrectable means that the sector would be relocated the next time it is accessed for writing... so it is on a "wait for relocation" status. I don't know of any other way to force this relocation other tha actually writing over the sector (a simple read doesn't trigger the relocation)...

And yes, I know that a disk with bad blocks isn't reliable, but you remember? I'm too lazy to send my home disks back to the manufacturer ;)

Lorenzo

nate

4:26 p.m.

Lorenzo Quatrini wrote:

...

For what I understand Offline uncorrectable means that the sector would be relocated the next time it is accessed for writing... so it is on a "wait for relocation" status. I don't know of any other way to force this relocation other tha actually writing over the sector (a simple read doesn't trigger the relocation)...

Not sure myself but the manufacturer's testing tools have non destructive ways of detecting and re-mapping bad sectors. Of course a downside to the manufacturer's tools is they often only support a limited number of disk controllers.

It's probably been since the IBM Deathstar 75GXP that I last recall having drives with bad sectors on them but typically at least at that time, when the OS encountered a bad sector it didn't handle it too gracefully, often times had to reboot the system. Perhaps the linux kernel is more robust for those things these days (I had roughly 75% of my 75GXP drives fail - more than 30).

Interesting that the man page for e2fsck in RHEL 4 doesn't describe the -c option, but the man page for it in RHEL 3 does, not sure if that is significant(RHEL4 man page mentions the option, but no clear description of what it does). Haven't checked RHEL/CentOS 5.

from RHEL 3 manpage: -c This option causes e2fsck to run the badblocks(8) program to find any blocks which are bad on the filesystem, and then marks them as bad by adding them to the bad block inode. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.

So if you haven't heard of it already, try e2fsck -c <device> ? I recall using this off and on about 10 years ago but found the manufacturer's tools to be more accurate.

...

And yes, I know that a disk with bad blocks isn't reliable, but you remember? I'm too lazy to send my home disks back to the manufacturer ;)

Ahh ok, I see...just keep in mind that it's quite possible the bad sector count will continue to mount as time goes on..

good luck ..

nate

Akemi Yagi

4:29 p.m.

On Fri, Aug 22, 2008 at 9:26 AM, nate centos@linuxpowered.net wrote:

...

Lorenzo Quatrini wrote:

...

Not sure myself but the manufacturer's testing tools have non destructive ways of detecting and re-mapping bad sectors. Of course a downside to the manufacturer's tools is they often only support a limited number of disk controllers.

It's probably been since the IBM Deathstar 75GXP that I last recall having drives with bad sectors on them but typically at least at that time, when the OS encountered a bad sector it didn't handle it too gracefully, often times had to reboot the system. Perhaps the linux kernel is more robust for those things these days (I had roughly 75% of my 75GXP drives fail - more than 30).

Interesting that the man page for e2fsck in RHEL 4 doesn't describe the -c option, but the man page for it in RHEL 3 does, not sure if that is significant(RHEL4 man page mentions the option, but no clear description of what it does). Haven't checked RHEL/CentOS 5.

from RHEL 3 manpage: -c This option causes e2fsck to run the badblocks(8) program to find any blocks which are bad on the filesystem, and then marks them as bad by adding them to the bad block inode. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.

So if you haven't heard of it already, try e2fsck -c <device> ? I recall using this off and on about 10 years ago but found the manufacturer's tools to be more accurate.

...
And yes, I know that a disk with bad blocks isn't reliable, but you remember? I'm too lazy to send my home disks back to the manufacturer ;)

Ahh ok, I see...just keep in mind that it's quite possible the bad sector count will continue to mount as time goes on..

There is a thread on this topic in the CentOS forum:

http://www.centos.org/modules/newbb/viewtopic.php?topic_id=15880&forum=3...

Akemi

William L. Maltby

4:30 p.m.

On Fri, 2008-08-22 at 18:07 +0200, Lorenzo Quatrini wrote:

...

nate ha scritto:

...
<snip>

...

For what I understand Offline uncorrectable means that the sector would be relocated the next time it is accessed for writing... so it is on a "wait for relocation" status.

If my memory is still good (I don't recall if it is or not! :-) you are correct.

...

I don't know of any other way to force this relocation other tha actually writing over the sector (a simple read doesn't trigger the relocation)...

You can force this with dd using various seek, skip and blksize parameters to (re)write only the desired sectors. The "if=" parameter would reference the physical device or partition and the "skip=" would be the offset to the sector. Be very careful and have good backups. In fact, you could test by making an image of the partition and doing a test run on that.

It might be a lot easier to reference the sector from start of disk, as the reported sectors will be in reference to that.

...

And yes, I know that a disk with bad blocks isn't reliable, but you remember? I'm too lazy to send my home disks back to the manufacturer ;)

If my other post is correct, it may still be reliable (or be getting old enough to become so).

...

Lorenzo

<snip sig stuff>

HTH

-- Bill

Scott Silva

4:59 p.m.

on 8-22-2008 9:07 AM Lorenzo Quatrini spake the following:

...

nate ha scritto:

...
Lorenzo Quatrini wrote:

...
I have few disk that have offline uncorrectables sectors;

Ideally it should be done using the manufacturer's tools, and really any disk that has even one bad sector that the OS can see should not be relied upon, it should be considered a failed disk. Disks automatically keep spare sectors that the operating system cannot see and re-maps bad sectors to them, if your seeing bad sectors that means that collection of spares has been exhausted. I've never seen a disk manufacturer not accept a disk that had bad sectors on it (that was still under warranty) in as long as I can remember..

nate

For what I understand Offline uncorrectable means that the sector would be relocated the next time it is accessed for writing... so it is on a "wait for relocation" status. I don't know of any other way to force this relocation other tha actually writing over the sector (a simple read doesn't trigger the relocation)...

And yes, I know that a disk with bad blocks isn't reliable, but you remember? I'm too lazy to send my home disks back to the manufacturer ;)

Then I hope you are not too lazy to do some proper backups! Sending a disk back to be replaced is a lot less work then recovering a failed array when the disk tanks. How much is your data worth? I know by experience that a 6 drive raid 5 array can run near $10,000 US to recover.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

William L. Maltby

4:22 p.m.

On Fri, 2008-08-22 at 08:59 -0700, nate wrote:

...

Lorenzo Quatrini wrote:

...
I have few disk that have offline uncorrectables sectors;

Ideally it should be done using the manufacturer's tools,

Second that!

...

and really any disk that has even one bad sector that the OS can see should not be relied upon, it should be considered a failed disk. Disks automatically keep spare sectors that the operating system cannot see and re-maps bad sectors to them, if your seeing bad sectors that means that collection of spares has been exhausted. I've never seen a disk manufacturer

?? Uncertain about "spares has been exhausted". I recently had one SATA drive that kept reporting a bad sector (actually grew to three). Being inured against panic attacks by long exposure to panic-inducing situations, I decided to let it ride a bit (it was an empty re-used partition upon which I would mke2fs and temporarily mount and use) and see if the number continued to grow. To this end, I ran the smart tools extended tests, several times over a period of a week, and saw no new ones. This was reassuring as traditionally if failure is imminent the number tends to grow quickly. A few appearances of bad sectors early in the drive lifetime is not an unusual occurrence and is not reason for trade in of the drive (after all, in this case the manufacturer just runs the repair software on it and re-sells it). It *is* a reason for heightened caution and alertness, depending on your situation.

After deciding the drive was not in its death-throes, I downloaded the DOS utilities from the manufacturer web site and ran the repair utilities. No smart tools reports of bad sectors since then (about 2 months so far).

Now, I don't know (or care) if an alternate sector was assigned, just that the sector was flagged unusable. For my use (temporary use - no permanent or critical data) this is fine. Last several mke2fs runs have produced the same amount of usable blocks and i-nodes, so I don't see evidence that no spare was available.

I do expect that a few more sectors will be found as the drive ages until the manufacturing weak areas have all aged sufficiently to cause failures.

...

not accept a disk that had bad sectors on it (that was still under warranty) in as long as I can remember..

If your application is critical and you still have warranty, the only cost is inconvenience, delay and more work to get it exchanged. You will likely receive a "reconditioned" drive though. So for me, in my situation, the download and use of the manufacturers repair software is better. Only bad part is instead of using floppies now, they seem to want a CD/DVD to boot from. A minor inconvenience considering the alternatives.

...

nate

<snip sig stuff>

HTH

-- Bill

nate

4:33 p.m.

William L. Maltby wrote:

...

?? Uncertain about "spares has been exhausted".

I don't recall where I read it, and I suppose it may be misinformation, but it made sense at the time. The idea is the disks are not made to hold EXACTLY the amount of blocks that the specs are for. There are some extra blocks, that the disk "hides" from the disk controller. The disk automatically re-maps these hidden blocks(making them visible again). By the time bad blocks start showing up on the OS level these extra blocks are already full, an indication that there is far more bad blocks on the disk than just the ones that you can see at the OS level.

...

Now, I don't know (or care) if an alternate sector was assigned, just that the sector was flagged unusable. For my use (temporary use - no permanent or critical data) this is fine. Last several mke2fs runs have produced the same amount of usable blocks and i-nodes, so I don't see evidence that no spare was available.

Note that mke2fs doesn't write over the entire disk, I doubt it even scans the entire disk. I've used a technology called thin provisioning where only data that is written to disk is actually allocated on disk(e.g. you can create a 1TB volume, if you only write 1GB to it, it only uses 1GB, allowing you to oversubscribe the system, and dynamically grow physical storage as needed). When allocating thinly provisioned volumes and formatting them with mke2fs, even on multi hundred gig systems only a few megs are written to disk(perhaps a hundred megs).

nate

William L. Maltby

4:57 p.m.

On Fri, 2008-08-22 at 09:33 -0700, nate wrote:

...

William L. Maltby wrote:

...
?? Uncertain about "spares has been exhausted".

I don't recall where I read it, and I suppose it may be misinformation, but it made sense at the time. The idea is the disks are not made to hold EXACTLY the amount of blocks that the specs are for. There are some extra blocks, that the disk "hides" from the disk controller. The disk automatically re-maps these hidden blocks(making them visible again). By

That is correct. Back in the old days, we had access to a "spares" cylinder and could manually maintain the alternate sectors table. We could wipe it, add sectors etc.

As technology progressed, this capability disappeared and the drive electronics and proms began taking care of it.

What I don't know (extreme lack of sufficient interest to find out so far) is if the self-monitoring tools report a sector when a *read* results in either a hard or soft failure and if it tries to reassign at that time. My local evidence seems to indicate that the report is made at read time but assignment of a spare is not made then. This because the same three sectors kept reporting over and over.

After running the repair software, messages stopped, indicating that the bad sector was then marked unusable and alternate sectors had been assigned.

...

the time bad blocks start showing up on the OS level these extra blocks are already full, an indication that there is far more bad blocks on the disk than just the ones that you can see at the OS level.

Correct.

...

...
Now, I don't know (or care) if an alternate sector was assigned, just that the sector was flagged unusable. For my use (temporary use - no permanent or critical data) this is fine. Last several mke2fs runs have produced the same amount of usable blocks and i-nodes, so I don't see evidence that no spare was available.

Note that mke2fs doesn't write over the entire disk, I doubt it even scans the entire disk.

Correct, unless the check is forced. I failed to note in my previous post that a *substantial* portion of the partition was written (which I knew included the questionable sectors through manual math and the nature of file system usage).

...

I've used a technology called thin provisioning where only data that is written to disk is actually allocated on disk(e.g. you can create a 1TB volume, if you only write 1GB to it, it only uses 1GB, allowing you to oversubscribe the system, and dynamically grow physical storage as needed). When allocating thinly provisioned volumes and formatting them with mke2fs, even on multi hundred gig systems only a few megs are written to disk(perhaps a hundred megs).

Yep. Only a few copies of the superblock and the i-node tables are written by the file system make process. That's why it's important for files systems in critical applications to be created with the check forced. Folks should also keep in mind that the default check, read only, is really not sufficient for critical situations. The full write/read check should be forced on *new* partitions/disks.

...

nate

<snip sig stuff>

-- Bill

Lorenzo Quatrini

25 Aug 25 Aug

8:43 a.m.

William L. Maltby ha scritto:

...

Yep. Only a few copies of the superblock and the i-node tables are written by the file system make process. That's why it's important for files systems in critical applications to be created with the check forced. Folks should also keep in mind that the default check, read only, is really not sufficient for critical situations. The full write/read check should be forced on *new* partitions/disks.

So again my question is: can I use dd to "test" the disk? what about

dd if=/dev/sda of=/dev/sda bs=512

Is this safe on a full running system? Has to be done at runlevel 1 or with a live cd? I think this is "better" than the manufactureur way, as dd is always present and works with any brand.

Lorenzo

Stephen Harris

10:36 a.m.

On Mon, Aug 25, 2008 at 10:43:01AM +0200, Lorenzo Quatrini wrote:

...

So again my question is: can I use dd to "test" the disk? what about

dd if=/dev/sda of=/dev/sda bs=512

Is this safe on a full running system? Has to be done at runlevel 1 or with a live cd?

Do not do this on a mounted filesystem; you risk corruption. I'd be leary of this command, though.

A better way is use the "badblocks" command; if you want to keep data then "badblocks -n"; if you don't care about data then "badblocks -w". Again, you can't do this on a mounted filesystem.

-- rgds Stephen

William L. Maltby

11:24 a.m.

On Mon, 2008-08-25 at 06:36 -0400, Stephen Harris wrote:

...

On Mon, Aug 25, 2008 at 10:43:01AM +0200, Lorenzo Quatrini wrote:

...
So again my question is: can I use dd to "test" the disk? what about

dd if=/dev/sda of=/dev/sda bs=512

Is this safe on a full running system? Has to be done at runlevel 1 or with a live cd?

Do not do this on a mounted filesystem; you risk corruption. I'd be leary of this command, though.

Whoo-hoo! The question un-asked ...

I didn't even think of mentioning this to him in my other reply. I'm glad you jumped on that.

...

A better way is use the "badblocks" command; if you want to keep data then "badblocks -n"; if you don't care about data then "badblocks -w". Again, you can't do this on a mounted filesystem.

This is *far* superior to the OP's thoughts of dd.

And I'll remind here, mentioned in my other post, about "hard" and "soft" errors. "Soft" errors are not seen by the OS.

"Badblocks" (which really should be invoked via mke2fs or e2fsck rather than manually) has useful, but limited, utility in ensuring reliability. And it does require some small storage space in the file system. And it does *not* assign alternate blocks (that is, it does not take advantage of the hardware alternate block capability). And it is not "predictive", thereby being useful only for keeping an FS usable *after* data has been (potentially) lost on an existing file system. It's best utility is at FS creation and check time. It also has use if you can un-mount the FS (ignoring the "force" capability provided) but cannot take the system down to run manufacturer-specific diagnostic and repair software.

...

-- Bill

Nifty Cluster Mitch

7:03 p.m.

On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:

...

"Badblocks" (which really should be invoked via mke2fs or e2fsck rather than manually) has useful, but limited, utility in ensuring reliability. And it does require some small storage space in the file system. And it does *not* assign alternate blocks (that is, it does not take advantage of the hardware alternate block capability). And it is not "predictive", thereby being useful only for keeping an FS usable *after* data has been (potentially) lost on an existing file system. It's best utility is at FS creation and check time. It also has use if you can un-mount the FS (ignoring the "force" capability provided) but cannot take the system down to run manufacturer-specific diagnostic and repair software.

It might be interesting to add a "catch 22" story.

I once added -c flags to /fsckoptions and "touch"ed /forcefsck. I had to take the disk to the lab and fix it on a bench system.

-- T o m M i t c h e l l Got a great hat... now what.

William L. Maltby

7:43 p.m.

On Mon, 2008-08-25 at 12:03 -0700, Nifty Cluster Mitch wrote:

...

On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:

...
<snip>

...

...
(potentially) lost on an existing file system. It's best utility is at FS creation and check time. It also has use if you can un-mount the FS (ignoring the "force" capability provided) but cannot take the system down to run manufacturer-specific diagnostic and repair software.

It might be interesting to add a "catch 22" story.

I once added -c flags to /fsckoptions and "touch"ed /forcefsck. I had to take the disk to the lab and fix it on a bench system.

YOIKS! Any explanation why such a reliable process would cause such a result? Was it a long time ago with a buggy e2fsck maybe? Did you mean to say you added the "-f" flag and the FS was mounted and active at the time? Is it just one of those "Mysteries of the Universe"? I hate those!

...

<snip>

-- Bill

Nifty Cluster Mitch

10:36 p.m.

On Mon, Aug 25, 2008 at 03:43:18PM -0400, William L. Maltby wrote:

...

On Mon, 2008-08-25 at 12:03 -0700, Nifty Cluster Mitch wrote:

...
On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:

...
<snip>

...
...
(potentially) lost on an existing file system. It's best utility is at FS creation and check time. It also has use if you can un-mount the FS (ignoring the "force" capability provided) but cannot take the system down to run manufacturer-specific diagnostic and repair software.

It might be interesting to add a "catch 22" story.

I once added -c flags to /fsckoptions and "touch"ed /forcefsck. I had to take the disk to the lab and fix it on a bench system.

YOIKS! Any explanation why such a reliable process would cause such a result? Was it a long time ago with a buggy e2fsck maybe? Did you mean to say you added the "-f" flag and the FS was mounted and active at the time? Is it just one of those "Mysteries of the Universe"? I hate those!

The removal of /forcefsck would never happen when badblocks was run. Something wonkey perhaps because I did have a disk with defects..

Might be worth a retry next time I need to clean and reload a machine but I do not know how to reproduct the disk hardware issue.

Gone are the days where disk controllers gave you the ability to 'expose' defects.

-- T o m M i t c h e l l Got a great hat... now what.

William L. Maltby

11:13 p.m.

On Mon, 2008-08-25 at 15:36 -0700, Nifty Cluster Mitch wrote:

...

On Mon, Aug 25, 2008 at 03:43:18PM -0400, William L. Maltby wrote:

...
On Mon, 2008-08-25 at 12:03 -0700, Nifty Cluster Mitch wrote:

...
On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:

...
<snip>

...
...
(potentially) lost on an existing file system. It's best utility is at FS creation and check time. It also has use if you can un-mount the FS (ignoring the "force" capability provided) but cannot take the system down to run manufacturer-specific diagnostic and repair software.

It might be interesting to add a "catch 22" story.

I once added -c flags to /fsckoptions and "touch"ed /forcefsck. I had to take the disk to the lab and fix it on a bench system.

YOIKS! Any explanation why such a reliable process would cause such a result? Was it a long time ago with a buggy e2fsck maybe? Did you mean to say you added the "-f" flag and the FS was mounted and active at the time? Is it just one of those "Mysteries of the Universe"? I hate those!

The removal of /forcefsck would never happen when badblocks was run. Something wonkey perhaps because I did have a disk with defects..

Might be worth a retry next time I need to clean and reload a machine but I do not know how to reproduct the disk hardware issue.

Gone are the days where disk controllers gave you the ability to 'expose' defects.

I don't have an available "smart" drive here at home, but I do have some older stuff. I think we can "emulate" defects by defining a partition that runs a few sectors beyond the end of the HD. Then mke2fs giving the -c -c and a manually specified size that includes the phantom sectors.

When I get time (won't be RSN) I'll do both a mke2fs test and then an e2fsck test. What I don't know is if notification of "beyond media end" is sent by hardware and caught by drivers or if drivers just catch an error and a bad block (sector) is presumed, to be logged and avoided. ISTR (on SCSI anyway) that read past media end was handled. But, this ain't SCSI! 8-)

If someone has a setup that makes this a quick and easy test to run sooner than I'll be able to, that would be "peachy".

...

<snip>

-- Bill

William L. Maltby

10:53 a.m.

On Mon, 2008-08-25 at 10:43 +0200, Lorenzo Quatrini wrote:

...

William L. Maltby ha scritto:

...
Yep. Only a few copies of the superblock and the i-node tables are written by the file system make process. That's why it's important for files systems in critical applications to be created with the check forced. Folks should also keep in mind that the default check, read only, is really not sufficient for critical situations. The full write/read check should be forced on *new* partitions/disks.

First, a correction. I earlier mentioned "-C" as causing the read/write check for mke2fs. It is "-c -c". I must've been thinking of some other FS software.

...

So again my question is: can I use dd to "test" the disk? what about

dd if=/dev/sda of=/dev/sda bs=512

It ought to do what you think it would. But ...

...

Is this safe on a full running system? Has to be done at runlevel 1 or with a live cd?

Safe on a full running system? Probably. I suggest a test before you do it on an important system. I've never had the urge to do it the way you suggest. It can be done at run level 1 or from a live CD too. But ..

...

I think this is "better" than the manufacturer way, as dd is always present and works with any brand.

s/better/convenient/ # IMO

Now for the "buts". I presume that there are still two basic types of media errors on HDs, "hard" and "soft". Hard errors are those that are not recoverable through the normal hardware crc check process (or whatever they use these days). Soft errors are errors that are recoverable via the normal hardware crc check process.

Hard errors are always reported to the OS, soft errors are not, IIRC. So you could have recovered media failures that do not get reported to the OS. IF these failures are early indicators of deteriorating media you will not be notified of them.

For this reason, hardware-specific diagnostic software is "better". Further, the "smart" capabilities are *really* hardware specific and will detect and report things that normal read/write activities, like dd, cannot.

As to running on a live system, you might not want to for several reasons. If you are using the system to do anything useful at the time, there will be a big hit on responsiveness. Unlike the real original UNIX, Linux still does not have preemptive scheduling (somebody please correct me if I missed this potentially earth-shattering advancement - last I heard, earliest was to be the 2.7 kernel, presuming no slippage).

Because dd is fast, it will consume all I/O capability, especially the way you propose running it. Further, you will be causing a *LARGE* number of system calls, further degrading system responsiveness. It could be so slow to respond that one might think the system is "frozen".

If you insist on doing this, I would suggest something like

nice <:your priority here:> dd if=/dev/xxxx of=/dev/xxxx bs=16384&

"Man nice" for details. This helps a little bit. I've not tried to see how much responsiveness can be "recovered". A larger "bs=" will reduce system calls, but will increase buffer sizes and usage and increase I/O load. Even if you omit the trailing "&" to run in foreground, the responsiveness may be so slow that a <CTL>-<C> may appear to fail and make you think the system is "frozen"... for a little while.

The larger "bs=" would seem to negate what you want with the "bs=512". Not so. Since the detection of failures happens on the hardware, it will still detect failures and handle them as it normally would. The "bs=" is only a blocking factor. Your "512" only saves doing math to figure out what the "sector" really is. But it has a large cost. BTW, you don't really know what the sector size is these days. It may not be 512. Back in the old days, sector size was selectable via jumpers. Today I suspect the drives don't have sectors in the same way/size as they used to.

Closing (really, they are!) arguments: 1. Any OS, rather than hardware specific, test will be less rigorous. This is "optimal" only if other factors trump reliability. Usually "convenience" and "portability" will not trump reliability for server or critical platforms.

2. The "smart" feature has capabilities of which you may not be aware. One of these is to run in such a way as to minimize performance impact on a live system. If you've run "makewhatis", then "man -k smart" or "apropos smart" will get you started on the reading you may want to do.

3. Hardware-specific diagnostics and repair utilities from the manufacturer (this includes the "smart" capability of the drives) will be more rigorous and reliable than general-purpose utilities.

4. The manufacturer utilities can "repair" media failures as they are detected. If you are taking the time to run diagnostics, why not fix failures at the same time? If you believe that the "dd" way can accomplish the same thing (through the alternate block assignment process), why not grab a drive with known bad sectors and run a test to see if it will be satisfactory to you?

...

Lorenzo

<snip sig stuff>

-- Bill

Nifty Cluster Mitch

6:53 p.m.

On Mon, Aug 25, 2008 at 10:43:01AM +0200, Lorenzo Quatrini wrote:

...

William L. Maltby ha scritto:

...
Yep. Only a few copies of the superblock and the i-node tables are written by the file system make process. That's why it's important for files systems in critical applications to be created with the check forced. Folks should also keep in mind that the default check, read only, is really not sufficient for critical situations. The full write/read check should be forced on *new* partitions/disks.

So again my question is: can I use dd to "test" the disk? what about

dd if=/dev/sda of=/dev/sda bs=512

Is this safe on a full running system? Has to be done at runlevel 1 or with a live cd? I think this is "better" than the manufactureur way, as dd is always present and works with any brand.

It is not safe on a mounted filesystem or devices with mounted filesystems.

File system code on a partition will have no coherency interaction with the entire raw device.

See the -f flag in the "badblocks" man page: "-f Normally, badblocks will refuse to do a read/write or a non- destructive test on a device which is mounted, since either can cause the system to potentially crash and/or damage the filesys- tem even if ....."

It is also not 100% clear to me that the kernel buffer code will not see a paired set of "dd" commands as a no op and skip the write.

Vendor tools on an unmounted disk operate at a raw level and also have access to the vendor specific embedded controller commands bypassing buffering and directly interacting with error codes and retry counts and more.

In normal operation the best opportunity to spare a sector or track is on a write..... At that time the OS, and disk both have known good data so a read after write can detect the defect/ error and take the necessary action without loss of data. Some disks have read heads that follow the write heads to this end. Other disks require an additional revolution....

When "mke2fs -c -c " is invoked the second -c flag is important because the paired read/write can let the firmware on the disk map detected defects to spares. With a single "-c" flag the Linux filesystem code can assign the error blocks to non files . A system admin that does a dd read of a problem disk may find that the OS hurls on the errors and takes the device off line. i.e. this command: dd if=/dev/sda of=/dev/sda bs=512 might not do the expected because the first read can take the device off line negating the follow up write intended to fix things.

The tool "hdparm: is rich in info -- some flags are dangerous.

Bottom line... use vendor tools.... Vendors like error reports from their tools for RMA processing and warranty...

BTW: smartd is a good thing. For me any disk that smartd had made noise about has failed... often with weeks or months of warning...

-- T o m M i t c h e l l Got a great hat... now what.

Lorenzo Quatrini

26 Aug 26 Aug

8:38 a.m.

Nifty Cluster Mitch ha scritto:

...

Bottom line... use vendor tools.... Vendors like error reports from their tools for RMA processing and warranty...

BTW: smartd is a good thing. For me any disk that smartd had made noise about has failed... often with weeks or months of warning...

So... ok, I see the point: I should monitor for SMART errors and then use vendor tools to fix things...

(BTW, the pc which triggered the tread reallocated the sector by himself: I guess that finally the OS tried to write to the bad sector and the disk did all the magic relocation thing)

Also I finally noticed that badblocs has a non-distructive read-write mode (the man page is outdated and doesn't mention that) which can be used routinely (say once at month) to force a check of the whole disk.

Thanks to all for the explanation

Regards

Lorenzo Quatrini

William L. Maltby

12:05 p.m.

On Tue, 2008-08-26 at 10:38 +0200, Lorenzo Quatrini wrote:

...

<snip>

...

Also I finally noticed that badblocs has a non-distructive read-write mode (the man page is outdated and doesn't mention that) which can be used routinely (say once at month) to force a check of the whole disk.

...

From "man badblocks":

-n Use non-destructive read-write mode. By default only a non- destructive read-only test is done. This option must not be combined with the -w option, as they are mutually exclusive.

Note the phrase beginning with "By default only...". I'll admit it could be more clearly stated.

...

Thanks to all for the explanation

Regards

Lorenzo Quatrini

<snip>

-- Bill

Lorenzo Quatrini

2:02 p.m.

William L. Maltby ha scritto:

...

From "man badblocks":

-n Use non-destructive read-write mode. By default only a non- destructive read-only test is done. This option must not be combined with the -w option, as they are mutually exclusive.

Note the phrase beginning with "By default only...". I'll admit it could be more clearly stated.

The Italian translation of the man page is outdated... I guess I sould stick with the original version of man pages, or at least remember to check them.

Lorenzo

Nifty Cluster Mitch

27 Aug 27 Aug

4:02 p.m.

On Tue, Aug 26, 2008 at 04:02:22PM +0200, Lorenzo Quatrini wrote:

...

William L. Maltby ha scritto:

...
From "man badblocks":

-n Use non-destructive read-write mode. By default only a non- destructive read-only test is done. This option must not be combined with the -w option, as they are mutually exclusive.

Note the phrase beginning with "By default only...". I'll admit it could be more clearly stated.

The Italian translation of the man page is outdated... I guess I sould stick with the original version of man pages, or at least remember to check them.

Consider filing a bug -- One goal for the user community is to turn the old phrase RTFM to be "Read The Fine Manual".... in contrast to the historic profanity.

You can file it against either the English, the Italian translation or both.

As an alternative you can post a difference file to a list like this for discussion and ask ONE person to help you file the bug.

Translations are commonly not done by the maintainer so a bug can be the best path. If you need help with the mechanics of filing a bug ask...

-- T o m M i t c h e l l Got a great hat... now what.

6310

Age (days ago)

6315

Last active (days ago)

discuss@lists.centos.org

22 comments

7 participants

tags (0)

participants (7)

Akemi Yagi
Lorenzo Quatrini
nate
Nifty Cluster Mitch
Scott Silva
Stephen Harris
William L. Maltby