Hi All.
Yesterday I was installing Centos 5.5 to my web server, and it looks like the main hard drive has gone AWOL.
Fedora 12 put the file system into r/o mode.
The drive is an Hitachi, still under warranty.
There are bad sectors on it, and running the Hitachi DFT tool confirms this. Also I cannot repair the bad sectors.
Would this be caused by a faulty I/O chip, or is it safe to say it's definately the HDD at fault?
Kind Regards,
Keith Roberts
On 10/31/2010 3:27 PM, Keith Roberts wrote:
Hi All.
Yesterday I was installing Centos 5.5 to my web server, and it looks like the main hard drive has gone AWOL.
Fedora 12 put the file system into r/o mode.
The drive is an Hitachi, still under warranty.
There are bad sectors on it, and running the Hitachi DFT tool confirms this. Also I cannot repair the bad sectors.
Would this be caused by a faulty I/O chip, or is it safe to say it's definately the HDD at fault?
Kind Regards,
Keith Roberts
hdd at fault
On Sun, 31 Oct 2010, William Warren wrote:
To: CentOS mailing list centos@centos.org From: William Warren hescominsoon@emmanuelcomputerconsulting.com Subject: Re: [CentOS] PATA Hard Drive woes
On 10/31/2010 3:27 PM, Keith Roberts wrote:
Hi All.
Yesterday I was installing Centos 5.5 to my web server, and it looks like the main hard drive has gone AWOL.
Fedora 12 put the file system into r/o mode.
The drive is an Hitachi, still under warranty.
There are bad sectors on it, and running the Hitachi DFT tool confirms this. Also I cannot repair the bad sectors.
Would this be caused by a faulty I/O chip, or is it safe to say it's definately the HDD at fault?
Kind Regards,
Keith Roberts
hdd at fault
OK - thanks for confirming that Bill.
I'll remove it and take it back for replacement.
Keith
On Sun, 31 Oct 2010, Keith Roberts wrote:
To: CentOS mailing list centos@centos.org From: Keith Roberts keith@karsites.net Subject: Re: [CentOS] PATA Hard Drive woes
On Sun, 31 Oct 2010, William Warren wrote:
To: CentOS mailing list centos@centos.org From: William Warren hescominsoon@emmanuelcomputerconsulting.com Subject: Re: [CentOS] PATA Hard Drive woes
On 10/31/2010 3:27 PM, Keith Roberts wrote:
Hi All.
Yesterday I was installing Centos 5.5 to my web server, and it looks like the main hard drive has gone AWOL.
Fedora 12 put the file system into r/o mode.
The drive is an Hitachi, still under warranty.
There are bad sectors on it, and running the Hitachi DFT tool confirms this. Also I cannot repair the bad sectors.
Would this be caused by a faulty I/O chip, or is it safe to say it's definately the HDD at fault?
Kind Regards,
Keith Roberts
hdd at fault
OK - thanks for confirming that Bill.
I'll remove it and take it back for replacement.
Keith
There were about 79 Seek errors in the SMART logs of the HDD.
I moved the drive from the Primary Master cable, to the Secondary Master cable, and I ran Hitachi's DFT tool, did a complete disk erase, and that terminated with errors.
So to prepare the disk for returning under warranty, I used another HDD utility to clean the disk again, still on Sec Master cable.
I used vivard 0.4 to do a complete disk erase.
That was on the http://www.ultimatebootcd.com/index.html
Under HDD utils.
vivard did not show any errors when doing a full disk erase.
So I ran an Advanced r/w scan again with Hitachi DFT, and the result was OK.
Any ideas what's happening please?
Is this disk usable, or is it still in need of replacing?
Kind Regards,
Keith Roberts
Keith Roberts wrote, On 11/03/2010 10:32 AM:
On Sun, 31 Oct 2010, Keith Roberts wrote:
<SNIP>
There were about 79 Seek errors in the SMART logs of the HDD.
<SNIP>
vivard did not show any errors when doing a full disk erase.
So I ran an Advanced r/w scan again with Hitachi DFT, and the result was OK.
Any ideas what's happening please?
WFG: In writing it all, the seek motor knocked the dust out of it's way? (what dust?) How about checking all the smart attributes and seeing if others are elevated. http://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes
Are you seeing any block "remap" activity? http://en.wikipedia.org/wiki/Hard_disk_drive#Error_handling
Is this disk usable, or is it still in need of replacing?
http://en.wikipedia.org/wiki/S.M.A.R.T.#Background You have gotten SMART errors from this drive already, so: You have to ask yourself, 'Do you feel lucky?', Well do y'a...
And the other question: If this drive up and dies shortly and I knew about the smart errors, will the data owner complain more or less to me about the drive death later or drive replacement hassle now?
Only YOU (and the data owner) know the risk trade-off levels you have to consider.
On Wed, 3 Nov 2010, Todd Denniston wrote:
To: CentOS mailing list centos@centos.org From: Todd Denniston Todd.Denniston@tsb.cranrdte.navy.mil Subject: Re: [CentOS] PATA Hard Drive woes
Keith Roberts wrote, On 11/03/2010 10:32 AM:
On Sun, 31 Oct 2010, Keith Roberts wrote:
<SNIP> > There were about 79 Seek errors in the SMART logs of the > HDD. > <SNIP> > vivard did not show any errors when doing a full disk erase. > > So I ran an Advanced r/w scan again with Hitachi DFT, and > the result was OK. > > Any ideas what's happening please?
WFG: In writing it all, the seek motor knocked the dust out of it's way? (what dust?) How about checking all the smart attributes and seeing if others are elevated. http://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes
Are you seeing any block "remap" activity? http://en.wikipedia.org/wiki/Hard_disk_drive#Error_handling
Is this disk usable, or is it still in need of replacing?
http://en.wikipedia.org/wiki/S.M.A.R.T.#Background You have gotten SMART errors from this drive already, so: You have to ask yourself, 'Do you feel lucky?', Well do y'a...
And the other question: If this drive up and dies shortly and I knew about the smart errors, will the data owner complain more or less to me about the drive death later or drive replacement hassle now?
Only YOU (and the data owner) know the risk trade-off levels you have to consider.
Thanks Todd for the reply.
There were no sectors remapped, which is odd as there were bad sectors originally on the drive. I ran MemTest86+ out of curiousity, and there are 5120 Errors, some at 0.4MB & 0.5 MB.
The BIOS has been playing up, not recognising the Primary Master drive. This is the channel the Hitachi disk was on when it developed the sector read errors.
Could a bad controller or bad RAM cause Hard Drive sector errors?
The drive is as good as uninstalled, so I may as well send it for replacement.
Regards,
Keith
NB: The box is down now, and I'll try and test and identify the bad memory module next.
On 11/03/10 17:01, Keith Roberts wrote:
There were no sectors remapped, which is odd as there were bad sectors originally on the drive. I ran MemTest86+ out of curiousity, and there are 5120 Errors, some at 0.4MB& 0.5 MB.
You should fix that first.
The BIOS has been playing up, not recognising the Primary Master drive. This is the channel the Hitachi disk was on when it developed the sector read errors.
Could a bad controller or bad RAM cause Hard Drive sector errors?
Neither bad RAM or a bad controllor can physically damage a hard drive. A bad controller will not cause reallocated sectors. It can however cause UDMA CRC errors and other weird non-SMART related behaviour.
The drive is as good as uninstalled, so I may as well send it for replacement.
Send the output of smartctl -a /dev/yourdisk, that'll give us more factual data than speculation.
On Wed, 3 Nov 2010, RedShift wrote:
To: CentOS mailing list centos@centos.org From: RedShift redshift@pandora.be Subject: Re: [CentOS] PATA Hard Drive woes
On 11/03/10 17:01, Keith Roberts wrote:
There were no sectors remapped, which is odd as there were bad sectors originally on the drive. I ran MemTest86+ out of curiousity, and there are 5120 Errors, some at 0.4MB& 0.5 MB.
You should fix that first.
Working on that one now ;)
The BIOS has been playing up, not recognising the Primary Master drive. This is the channel the Hitachi disk was on when it developed the sector read errors.
Could a bad controller or bad RAM cause Hard Drive sector errors?
Neither bad RAM or a bad controllor can physically damage a hard drive. A bad controller will not cause reallocated sectors. It can however cause UDMA CRC errors and other weird non-SMART related behaviour.
The drive is as good as uninstalled, so I may as well send it for replacement.
Send the output of smartctl -a /dev/yourdisk, that'll give us more factual data than speculation.
Will do as soon as the memory checks are done, and the machine is up again.
Keith
On Wednesday, November 03, 2010 02:51:02 pm RedShift wrote:
On 11/03/10 17:01, Keith Roberts wrote:
Could a bad controller or bad RAM cause Hard Drive sector errors?
Neither bad RAM or a bad controllor can physically damage a hard drive. A bad controller will not cause reallocated sectors. It can however cause UDMA CRC errors and other weird non-SMART related behaviour.
Might want to check the power supply as well. Bad/flakey power can indeed case damage to the drive surface; been there, done that, have two Maxtor 250GB drives with scribbled servo data to prove it.
On Wed, 3 Nov 2010, Lamar Owen wrote:
To: CentOS mailing list centos@centos.org From: Lamar Owen lowen@pari.edu Subject: Re: [CentOS] PATA Hard Drive woes
On Wednesday, November 03, 2010 02:51:02 pm RedShift wrote:
On 11/03/10 17:01, Keith Roberts wrote:
Could a bad controller or bad RAM cause Hard Drive sector errors?
Neither bad RAM or a bad controllor can physically damage a hard drive. A bad controller will not cause reallocated sectors. It can however cause UDMA CRC errors and other weird non-SMART related behaviour.
Might want to check the power supply as well. Bad/flakey power can indeed case damage to the drive surface; been there, done that, have two Maxtor 250GB drives with scribbled servo data to prove it.
OK.
I'm running the server from an APC UPS Back-UPS 650, so there should not be any glitches in the power supply, should there?
Keith
On 11/03/10 3:13 PM, Keith Roberts wrote:
I'm running the server from an APC UPS Back-UPS 650, so there should not be any glitches in the power supply, should there?
thats a simple standby kind of UPS, acts like a 'surge protector' when the AC is on, and only switches to the battery powered inverter when the AC is completely off.
Well I'm geting there slowly but surely.
This home-built server machine is using hard drive caddies.
I've taken my working backup drive from the caddy (secondary master), and replaced it with a small GB test drive.
The problem was originally with the drive connected to the onboard IDE primary channel being intermittently autodetected at boot time.
I have now swopped the IDE ribbon cables, so the cable that was connected to the primary IDE channel is now plugged into the secondary channel onboard IDE socket, and vice versa for the secondary ribbon cable.
Now when I reboot the machine the problem of drives not being detected now appears on the secondary channel, and the ATA drive and CD/DVD-ROM drive are detected OK on the primary channel.
I have also replaced the IDE ribbon cable for the channel that was originally connected as primary.
So it appears the onboard IDE controller is working OK, and the problem appears to be from the IDE ribbon cable, to one of the HDD caddies.
Any suggs please?
Kind Regards,
Keith Roberts
On 11/20/10 4:12 PM, Keith Roberts wrote:
So it appears the onboard IDE controller is working OK, and the problem appears to be from the IDE ribbon cable, to one of the HDD caddies.
Any suggs please?
Errr, if you have established that you have a bad cable, isn't the obvious solution to replace it? Be sure it is an 80-wire cable and connected correctly (they are usually keyed, but not always).
On 11/03/2010 03:13 PM, Keith Roberts wrote:
On Wed, 3 Nov 2010, Lamar Owen wrote:
Might want to check the power supply as well. Bad/flakey power can indeed case damage to the drive surface; been there, done that, have two Maxtor 250GB drives with scribbled servo data to prove it.
OK.
I'm running the server from an APC UPS Back-UPS 650, so there should not be any glitches in the power supply, should there?
Lamar was probably talking about the machine's *own* power supply. The one inside the computer case. When they start to fail they can produce incorrect DC voltages and then you can get all kinds of weird failures.
At Wed, 3 Nov 2010 22:13:03 +0000 (GMT) CentOS mailing list centos@centos.org wrote:
On Wed, 3 Nov 2010, Lamar Owen wrote:
To: CentOS mailing list centos@centos.org From: Lamar Owen lowen@pari.edu Subject: Re: [CentOS] PATA Hard Drive woes
On Wednesday, November 03, 2010 02:51:02 pm RedShift wrote:
On 11/03/10 17:01, Keith Roberts wrote:
Could a bad controller or bad RAM cause Hard Drive sector errors?
Neither bad RAM or a bad controllor can physically damage a hard drive. A bad controller will not cause reallocated sectors. It can however cause UDMA CRC errors and other weird non-SMART related behaviour.
Might want to check the power supply as well. Bad/flakey power can indeed case damage to the drive surface; been there, done that, have two Maxtor 250GB drives with scribbled servo data to prove it.
OK.
I'm running the server from an APC UPS Back-UPS 650, so there should not be any glitches in the power supply, should there?
Unless the power supply itself is failing.
Keith
On 11/3/2010 8:32 AM, Keith Roberts wrote:
So to prepare the disk for returning under warranty, I used another HDD utility to clean the disk again
...
So I ran an Advanced r/w scan again with Hitachi DFT, and the result was OK.
A complete disk wipe brings bad sectors to the drive's attention, forcing it to remap them using spare sectors set aside for the purpose.
All drives can do this, and they do it without logging the change. You can't tell, from the outside, when or whether the drive has done this. All you can do is infer it, because a sector that once tested bad now tests good.
As to why this happened only during a format, not during the previous disk test, it's probably because the format zeroed the disk. That particular drive may have a policy to only remap sectors on write, so as to preserve the sector contents for potential recovery later. (See below for one way this can be done.)
It may be that your drive is now fine.
If you put it back into service, at minimum I would set up smartd, from the smartmontools package. Maybe run smartctl on it by hand daily or weekly, too. If you find that errors start happening again, there is something continually degrading the drive's integrity, so the automatic sector remapping will eventually run the drive out of spare sectors.
SpinRite (http://spinrite.com/) does nondestructive sector remapping. At level 4 and above, it reads each sector in and then writes it back out to the drive. Because remapping is silent, it's possible for it to appear to do nothing, yet improve data integrity by bringing dodgy sectors to the drive's attention.
If a sector can't be read without error, SpinRite forces the drive to ignore the CRC and return the data anyway, retrying many times, then making a statistical guess about the most likely contents of the sector. (Reading a bad sector won't necessarily give the same value each try.) Then on writing the reconstructed data back out, the drive automatically remaps the sector, repairing it.
You might want to combine the SMART monitoring with periodic SpinRite runs on the drive until you regain confidence in it.
Warren Young wrote:
On 11/3/2010 8:32 AM, Keith Roberts wrote:
So to prepare the disk for returning under warranty, I used another HDD utility to clean the disk again
...
So I ran an Advanced r/w scan again with Hitachi DFT, and the result was OK.
A complete disk wipe brings bad sectors to the drive's attention, forcing it to remap them using spare sectors set aside for the purpose.
<snip>
If you put it back into service, at minimum I would set up smartd, from the smartmontools package. Maybe run smartctl on it by hand daily or weekly, too. If you find that errors start happening again, there is something continually degrading the drive's integrity, so the automatic sector remapping will eventually run the drive out of spare sectors.
<snip> Yeah, but I have problems with smartmon: for example, I've got a drive in one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and fsck -c... but on reboot, SMART seems to ignore what badblocks found, and the irritating messages are back.
mark
On 11/3/2010 11:27 AM, m.roth@5-cent.us wrote:
Yeah, but I have problems with smartmon:
More likely, problems with SMART. S.M.A.R.T. is D.U.M.B. :)
It's better than nothing, but sometimes not by a whole lot.
one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and fsck -c... but on reboot, SMART seems to ignore what badblocks found, and the irritating messages are back.
It may be that SpinRite could fix that by forcing a remap.
Another option -- which I didn't mention because it probably isn't an option for the original poster, but which may work with your servers -- is that some high-end RAID systems can do something like SpinRite at level 4+, as can ZFS. They call it resilvering. I don't think these systems do statistical reconstruction, but periodic read-then-rewrite can stave off the need to reconstruct.
Warren Young wrote:
On 11/3/2010 11:27 AM, m.roth@5-cent.us wrote:
Yeah, but I have problems with smartmon:
More likely, problems with SMART. S.M.A.R.T. is D.U.M.B. :)
It's better than nothing, but sometimes not by a whole lot.
one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and
fsck -c...
but on reboot, SMART seems to ignore what badblocks found, and the irritating messages are back.
It may be that SpinRite could fix that by forcing a remap.
Dunno if we have SpinRite around here.
Another option -- which I didn't mention because it probably isn't an option for the original poster, but which may work with your servers -- is that some high-end RAID systems can do something like SpinRite at level 4+, as can ZFS. They call it resilvering. I don't think these
No joy - it's a plain SATA drive, the root drive on a server we use for backups. ext3, and no, I'm not going to change filesystem types.... The real thing is why does SMART ignore the results of badblocks (for those who aren't sure, that's invoked when you do fsck -c), and for that matter, why the drive (Seagate ST3170811AS) doesn't automagically relocate those blocks.
mark
mark
m.roth@5-cent.us wrote:
Warren Young wrote:
On 11/3/2010 11:27 AM, m.roth@5-cent.us wrote:
Yeah, but I have problems with smartmon:
More likely, problems with SMART. S.M.A.R.T. is D.U.M.B. :)
It's better than nothing, but sometimes not by a whole lot.
one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and
fsck -c...
but on reboot, SMART seems to ignore what badblocks found, and the irritating messages are back.
It may be that SpinRite could fix that by forcing a remap.
Dunno if we have SpinRite around here.
Another option -- which I didn't mention because it probably isn't an option for the original poster, but which may work with your servers -- is that some high-end RAID systems can do something like SpinRite at level 4+, as can ZFS. They call it resilvering. I don't think these
No joy - it's a plain SATA drive, the root drive on a server we use for backups. ext3, and no, I'm not going to change filesystem types.... The real thing is why does SMART ignore the results of badblocks (for those who aren't sure, that's invoked when you do fsck -c), and for that matter, why the drive (Seagate ST3170811AS) doesn't automagically relocate those blocks.
AFAIK smart doesn't know or care about filesystems, it's at a lower level than that. fsck -c is a read-only scan, bad blocks are then added to the bad block inode (which smart knows nothing about), and this might not be enough for the disk to hide the blocks (which should satisfy smart). Maybe try fsck -cc for a non-destructive read-write test.
Warren Young wrote:
On 11/3/2010 12:22 PM, Nicolas Thierry-Mieg wrote:
Maybe try fsck -cc for a non-destructive read-write test.
Good call. That's resilvering.
Hmmm... maybe I'll try that first thing in the morning. I don't have to worry about users, since, as I said, it's an online backup server. With luck, it'll be done before I have to leave; in that case, I'd have to kill it.
mark
On 11/3/2010 1:04 PM, m.roth@5-cent.us wrote:
Warren Young wrote:
On 11/3/2010 11:27 AM, m.roth@5-cent.us wrote:
Yeah, but I have problems with smartmon:
More likely, problems with SMART. S.M.A.R.T. is D.U.M.B. :)
It's better than nothing, but sometimes not by a whole lot.
one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and
fsck -c...
but on reboot, SMART seems to ignore what badblocks found, and the irritating messages are back.
It may be that SpinRite could fix that by forcing a remap.
Dunno if we have SpinRite around here.
Another option -- which I didn't mention because it probably isn't an option for the original poster, but which may work with your servers -- is that some high-end RAID systems can do something like SpinRite at level 4+, as can ZFS. They call it resilvering. I don't think these
No joy - it's a plain SATA drive, the root drive on a server we use for backups. ext3, and no, I'm not going to change filesystem types.... The real thing is why does SMART ignore the results of badblocks (for those who aren't sure, that's invoked when you do fsck -c), and for that matter, why the drive (Seagate ST3170811AS) doesn't automagically relocate those blocks.
I think the point of SMART is to be aware of the physical conditions regardless of the logical remapping. At some point you run out of places to relocate.
On Wednesday, November 03, 2010 02:25:17 pm Les Mikesell wrote:
I think the point of SMART is to be aware of the physical conditions regardless of the logical remapping. At some point you run out of places to relocate.
I had a 1.5TB SATA drive pop up an error in Fedora 13 the other day; SMART had detected large numbers of bad (but remapped) sectors (53, to be exact), and Fedora 13 at least will warn you in GNOME when that is the case. Hopefully RHEL6 will include some of the tools that F13 has in palimpsest, as the ability to run the short and long self tests in an easy fashion is down right cool.
I ran the manufacturer's utility, but it gave the drive a clean bill of health, and thus I couldn't get an RMA, even though the drive is in warranty.
On 11/03/10 19:04, m.roth@5-cent.us wrote:
Warren Young wrote:
On 11/3/2010 11:27 AM, m.roth@5-cent.us wrote:
Yeah, but I have problems with smartmon:
More likely, problems with SMART. S.M.A.R.T. is D.U.M.B. :)
It's better than nothing, but sometimes not by a whole lot.
one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and
fsck -c...
but on reboot, SMART seems to ignore what badblocks found, and the irritating messages are back.
It may be that SpinRite could fix that by forcing a remap.
Dunno if we have SpinRite around here.
Another option -- which I didn't mention because it probably isn't an option for the original poster, but which may work with your servers -- is that some high-end RAID systems can do something like SpinRite at level 4+, as can ZFS. They call it resilvering. I don't think these
No joy - it's a plain SATA drive, the root drive on a server we use for backups. ext3, and no, I'm not going to change filesystem types.... The real thing is why does SMART ignore the results of badblocks (for those who aren't sure, that's invoked when you do fsck -c), and for that matter, why the drive (Seagate ST3170811AS) doesn't automagically relocate those blocks.
mark mark
Auto relocation happens ONLY when writing to foobar sectors. The drive WILL NOT relocate sectors that you are reading from because it cannot trust the content.
SMART reports the number of sectors that have been reallocated. That means, the drive was writing to a sector, found out it was unreliable and decided to remap that sector. It is not abnormal for drives to develop _a few_ bad sectors over the years.
RedShift wrote:
On 11/03/10 19:04, m.roth@5-cent.us wrote:
Warren Young wrote:
On 11/3/2010 11:27 AM, m.roth@5-cent.us wrote:
Yeah, but I have problems with smartmon:
More likely, problems with SMART. S.M.A.R.T. is D.U.M.B. :)
It's better than nothing, but sometimes not by a whole lot.
one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and
<snip>
No joy - it's a plain SATA drive, the root drive on a server we use for backups. ext3, and no, I'm not going to change filesystem types.... The real thing is why does SMART ignore the results of badblocks (for those who aren't sure, that's invoked when you do fsck -c), and for that matter, why the drive (Seagate ST3170811AS) doesn't automagically
relocate those
blocks.
Auto relocation happens ONLY when writing to foobar sectors. The drive WILL NOT relocate sectors that you are reading from because it cannot trust the content.
SMART reports the number of sectors that have been reallocated. That means, the drive was writing to a sector, found out it was unreliable and decided to remap that sector. It is not abnormal for drives to develop _a few_ bad sectors over the years.
Agreed. And these two bad sectors developed many months ago, and the number is not increasing, so I'm not really worried about them; all I want is to make the irritating messages in the logfiles go, and stay, away.
mark
On 11/03/10 19:57, m.roth@5-cent.us wrote:
SMART reports the number of sectors that have been reallocated. That means, the drive was writing to a sector, found out it was unreliable and decided to remap that sector. It is not abnormal for drives to develop _a few_ bad sectors over the years.
Agreed. And these two bad sectors developed many months ago, and the number is not increasing, so I'm not really worried about them; all I want is to make the irritating messages in the logfiles go, and stay, away.
mark
smartd is supposed to do that. The number of reallocated sectors is a prefailure SMART attribute. If it goes up in a short time your disk is failing. You can use the -I option in smartd.conf to ignore certain attributes. See man smartd.
RedShift wrote:
On 11/03/10 19:57, m.roth@5-cent.us wrote:
SMART reports the number of sectors that have been reallocated. That means, the drive was writing to a sector, found out it was unreliable and decided to remap that sector. It is not abnormal for drives to
develop
_a few_ bad sectors over the years.
Agreed. And these two bad sectors developed many months ago, and the number is not increasing, so I'm not really worried about them; all I want is to make the irritating messages in the logfiles go, and stay,
away.
smartd is supposed to do that. The number of reallocated sectors is a prefailure SMART attribute. If it goes up in a short time your disk is failing. You can use the -I option in smartd.conf to ignore certain attributes. See man smartd.
But I *want* it to tell me if more appear; I just want it to stop telling me about those two, persistantly (through reboots).
mark
On Wed, 3 Nov 2010, RedShift wrote:
To: CentOS mailing list centos@centos.org From: RedShift redshift@pandora.be Subject: Re: [CentOS] was, PATA Hard Drive woes, is "SMART"
On 11/03/10 19:04, m.roth@5-cent.us wrote:
Warren Young wrote:
On 11/3/2010 11:27 AM, m.roth@5-cent.us wrote:
Yeah, but I have problems with smartmon:
More likely, problems with SMART. S.M.A.R.T. is D.U.M.B. :)
It's better than nothing, but sometimes not by a whole lot.
one server that's got two bad sectors, which SMART reports. I've followed the instructions on how to make the log messages go away, and
fsck -c...
but on reboot, SMART seems to ignore what badblocks found, and the irritating messages are back.
It may be that SpinRite could fix that by forcing a remap.
Dunno if we have SpinRite around here.
Another option -- which I didn't mention because it probably isn't an option for the original poster, but which may work with your servers -- is that some high-end RAID systems can do something like SpinRite at level 4+, as can ZFS. They call it resilvering. I don't think these
No joy - it's a plain SATA drive, the root drive on a server we use for backups. ext3, and no, I'm not going to change filesystem types.... The real thing is why does SMART ignore the results of badblocks (for those who aren't sure, that's invoked when you do fsck -c), and for that matter, why the drive (Seagate ST3170811AS) doesn't automagically relocate those blocks.
mark mark
Auto relocation happens ONLY when writing to foobar sectors. The drive WILL NOT relocate sectors that you are reading from because it cannot trust the content.
SMART reports the number of sectors that have been reallocated. That means, the drive was writing to a sector, found out it was unreliable and decided to remap that sector. It is not abnormal for drives to develop _a few_ bad sectors over the years.
Is there _any_ way to tell how many reserved sectors have been used for remapping, and how many are still available. Just to give some idea of how long to carry on using a disk, before it runs out of spare sectors?
Regards,
Keith
Re: Original post. I've removed the 3 memory modules, and am now checking each DDR2 DIMM in turn, (and the DIMM slots as well) for any errors.
On Wed, 3 Nov 2010, Warren Young wrote:
To: CentOS mailing list centos@centos.org From: Warren Young warren@etr-usa.com Subject: Re: [CentOS] PATA Hard Drive woes
On 11/3/2010 8:32 AM, Keith Roberts wrote:
So to prepare the disk for returning under warranty, I used another HDD utility to clean the disk again
...
So I ran an Advanced r/w scan again with Hitachi DFT, and the result was OK.
A complete disk wipe brings bad sectors to the drive's attention, forcing it to remap them using spare sectors set aside for the purpose.
All drives can do this, and they do it without logging the change. You can't tell, from the outside, when or whether the drive has done this. All you can do is infer it, because a sector that once tested bad now tests good.
As to why this happened only during a format, not during the previous disk test, it's probably because the format zeroed the disk. That particular drive may have a policy to only remap sectors on write, so as to preserve the sector contents for potential recovery later. (See below for one way this can be done.)
It may be that your drive is now fine.
If you put it back into service, at minimum I would set up smartd, from the smartmontools package. Maybe run smartctl on it by hand daily or weekly, too. If you find that errors start happening again, there is something continually degrading the drive's integrity, so the automatic sector remapping will eventually run the drive out of spare sectors.
SpinRite (http://spinrite.com/) does nondestructive sector remapping. At level 4 and above, it reads each sector in and then writes it back out to the drive. Because remapping is silent, it's possible for it to appear to do nothing, yet improve data integrity by bringing dodgy sectors to the drive's attention.
If a sector can't be read without error, SpinRite forces the drive to ignore the CRC and return the data anyway, retrying many times, then making a statistical guess about the most likely contents of the sector. (Reading a bad sector won't necessarily give the same value each try.) Then on writing the reconstructed data back out, the drive automatically remaps the sector, repairing it.
You might want to combine the SMART monitoring with periodic SpinRite runs on the drive until you regain confidence in it.
Thanks Warren. I've read good reports about SpinRite.
I might shell out some dosh for a copy if it can non-destructably repair bad sectors. I heard it's worth running just to keep your HDD's in shape.
Regards,
Keith
On 11/3/2010 4:18 PM, Keith Roberts wrote:
I might shell out some dosh for a copy if it can non-destructably repair bad sectors.
Try fsck -cc first. (Or badblocks -n) These do part of what SR does already, so if they work, that's all you need. Step up only when you need something that tries harder. :)