smartd and smartctl

List overview All Threads
Download

newer

older

CentOS4 desktop has stopped...

yum update stuck...

m.roth＠5-cent.us

17 Feb 2012 17 Feb '12

7:52 p.m.

A few weeks ago, one of my servers started complaining, via smartd, that one drive had one unreadable sector. I umounted it, and ran an fsck -c, then remounted it. Error didn't go away. Now, what's really annoying is that I've gotten back to it today, and it's reporting the problem, as it has for weeks now, every half an hour.

However, when I run

...

smartctl -q errorsonly -H -l selftest -l error /dev/sdb

it gives me *nothing*. Anyone understand why I get two different results?

mark "and I am waiting for the smartctl -t long /dev/sdb to complete"

Show replies by date

Mike Burger

17 Feb 17 Feb

8:16 p.m.

...

A few weeks ago, one of my servers started complaining, via smartd, that one drive had one unreadable sector. I umounted it, and ran an fsck -c, then remounted it. Error didn't go away. Now, what's really annoying is that I've gotten back to it today, and it's reporting the problem, as it has for weeks now, every half an hour.

However, when I run

...
smartctl -q errorsonly -H -l selftest -l error /dev/sdb

it gives me *nothing*. Anyone understand why I get two different results?
 mark "and I am waiting for the smartctl -t long /dev/sdb to complete"

The smart system works at the hardware level, reading diagnostic information from the SMART circuitry on the hard drives, themselves. The hard drives will often, now, try to move the data from bad sectors on the platters to good sectors, and then mark them so that they won't be used, later.

Running fsck only works at the logical filesystem layer. The fsck tool has no hooks to deal with the physical layer.

-- Mike Burger http://www.bubbanfriends.org Visit the Dog Pound II BBS telnet://dogpound2.citadel.org http://dogpound2.citadel.org https://dogpound2.citadel.org To be notified of updates to the web site, visit: https://www.bubbanfriends.org/mailman/listinfo/site-update or send a blank email to: site-update-subscribe@bubbanfriends.org

m.roth＠5-cent.us

8:25 p.m.

Mike Burger wrote:

...

...
A few weeks ago, one of my servers started complaining, via smartd, that one drive had one unreadable sector. I umounted it, and ran an fsck -c, then remounted it. Error didn't go away. Now, what's really annoying is that I've gotten back to it today, and it's reporting the problem, as it has for weeks now, every half an hour.

However, when I run

...
smartctl -q errorsonly -H -l selftest -l error /dev/sdb

it gives me *nothing*. Anyone understand why I get two different results?
 mark "and I am waiting for the smartctl -t long /dev/sdb to
complete"
The smart system works at the hardware level, reading diagnostic information from the SMART circuitry on the hard drives, themselves. The hard drives will often, now, try to move the data from bad sectors on the platters to good sectors, and then mark them so that they won't be used, later.

Running fsck only works at the logical filesystem layer. The fsck tool has no hooks to deal with the physical layer.

Ok, but my thinking was, first, that after the fsck, the system wouldn't try to write to the bad sector, thus not provoking smart. The more annoying thing is that I don't understand why smartctl doesn't give the same info as smartd. When I do a -a, it does tell me that one sector's pending, but not that there's any error.

mark

Mike VanHorn

8:34 p.m.

FWIW, on some of my workstations, when I have gotten the "sector pending" messages, I have been able to take the drive out and run the manufacturer's diagnostics on it (in my case, Seatools), and that fixed some things and I haven't had any issues since.

--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.engineering.wright.edu/~mvanhorn/

On 2/17/12 3:25 PM, "m.roth@5-cent.us" m.roth@5-cent.us wrote:

Mike Burger wrote:

...

...
A few weeks ago, one of my servers started complaining, via smartd, that one drive had one unreadable sector. I umounted it, and ran an fsck -c, then remounted it. Error didn't go away. Now, what's really annoying is that I've gotten back to it today, and it's reporting the problem, as it has for weeks now, every half an hour.

However, when I run

...
smartctl -q errorsonly -H -l selftest -l error /dev/sdb

it gives me *nothing*. Anyone understand why I get two different results?
 mark "and I am waiting for the smartctl -t long /dev/sdb to
complete"
The smart system works at the hardware level, reading diagnostic information from the SMART circuitry on the hard drives, themselves. The hard drives will often, now, try to move the data from bad sectors on the platters to good sectors, and then mark them so that they won't be used, later.

Running fsck only works at the logical filesystem layer. The fsck tool has no hooks to deal with the physical layer.

mark

_______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

m.roth＠5-cent.us

10:04 p.m.

Mike VanHorn wrote:

...

FWIW, on some of my workstations, when I have gotten the "sector pending" messages, I have been able to take the drive out and run the manufacturer's diagnostics on it (in my case, Seatools), and that fixed some things and I haven't had any issues since.

Well, since the server has users on it, I can't really do that, or wipe the disk.... I'm not really worried - it's stayed at 1 sector. If that starts growing, then I'll worry, and get ready to replace the disk. Right now, it's just an annoyance, as I said, that it shows up on email logs from our loghost twice every hour. And I'm still waiting for anyone to explain to me what I'm doing using smartctl that results in it *not* telling me there's an error, or where the error is. In fact, the last long test I started, early this afternoon, seems to be done, and with smartctl -a, I see SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error # 1 Extended offline Completed without error 00% 2536 - # 2 Short offline Completed without error 00% 2529 -

So I'm befuddled why it won't tell me anything about this pending error.

mark

Yves Bellefeuille

11:40 p.m.

On Friday 17 February 2012, m.roth@5-cent.us wrote:

...

SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error # 1 Extended offline Completed without error 00% 2536

# 2 Short offline Completed without error 00% 2529

So I'm befuddled why it won't tell me anything about this pending error.

I've had a hard drive with *many* bad sectors, and still smartctl reported "Completed without error". I believe this means that there are still spare sectors available.

I suggest "badblocks -n", as I said in another message. This may also detect more bad sectors. What does "5 Reallocated_Sector_Ct" say?

-- Yves Bellefeuille yan@storm.ca "La Esperanta Civito ne rifuzas anticipe la kunlaboron de erarintoj, se ili konscias pri sia eraro." -- Heroldo Komunikas, n-ro 473.

Andrzej Szymański

8:36 p.m.

W dniu 2012-02-17 21:25, m.roth@5-cent.us pisze:

...

Mike Burger wrote: Ok, but my thinking was, first, that after the fsck, the system wouldn't try to write to the bad sector, thus not provoking smart. The more annoying thing is that I don't understand why smartctl doesn't give the same info as smartd. When I do a -a, it does tell me that one sector's pending, but not that there's any error.

Actually smartd is reporting THIS pending sector, and it probably won't stop until this sector is reallocated, which will happen on a nearest write to this sector.

As the location and contents of this sector are quite hard to find, the simplest, but the most troublesome way of solving the problem is moving all data away from this disk, writing the whole surface with zeros (dd) and moving the data back.

However, I would carefully monitor number of reallocated sectors on this disk. If it grows steadily, then better move your valuable data elsewhere.

Andrzej

Yves Bellefeuille

11:27 p.m.

On Friday 17 February 2012, Andrzej Szymański szymans@agh.edu.pl wrote:

...

As the location and contents of this sector are quite hard to find, the simplest, but the most troublesome way of solving the problem is moving all data away from this disk, writing the whole surface with zeros (dd) and moving the data back.

badblocks -n would also work, I imagine.

-- Yves Bellefeuille yan@storm.ca "La Esperanta Civito ne rifuzas anticipe la kunlaboron de erarintoj, se ili konscias pri sia eraro." -- Heroldo Komunikas, n-ro 473.

4936

Age (days ago)

4936

Last active (days ago)

discuss@lists.centos.org

7 comments

5 participants

tags (0)

participants (5)

Andrzej Szymański
m.roth＠5-cent.us
Mike Burger
Mike VanHorn
Yves Bellefeuille