On Thu, March 9, 2017 09:46, John Hodrien wrote:
On Thu, 9 Mar 2017, James B. Byrne wrote:
This indicated that a bad sector on the underlying disk system might be the source of the problem. The guests were all shutdown, a /forcefsck file was created on the host system, and the host system remotely restarted.
fsck's not good at finding disk errors, it finds filesystem errors.
If not fsck then what?
If it was a real disk issue, you'd expect matching errors in the host logs.
Yes, there are:
Mar 9 09:14:13 vhost03 kernel: end_request: I/O error, dev sda, sector 1236929063 Mar 9 09:14:30 vhost03 kernel: end_request: I/O error, dev sda, sector 1236929063 Mar 9 09:14:48 vhost03 kernel: end_request: I/O error, dev sda, sector 1236929063
I am running an extended SMART test on the drive at the moment. I suspect that the drive is probably at its EOL for practical purposes. So likely we will be looking at an equipment upgrade given the age of the rest of the equipment.
In the meantime what steps, if any, should I take to remediate this problem?
/var/log/messages:Mar 9 08:34:48 vhost03 kernel: EXT4-fs (dm-6): warning: maximal mount count reached, running e2fsck is recommended
Unmount it and run fsck on it, and that message would go away. But I'd not worry about that one.
jh
On Mar 10, 2017, at 6:32 AM, James B. Byrne byrnejb@harte-lyne.ca wrote:
On Thu, March 9, 2017 09:46, John Hodrien wrote:
fsck's not good at finding disk errors, it finds filesystem errors.
If not fsck then what?
badblocks(8).
On Fri, March 10, 2017 9:52 am, Warren Young wrote:
On Mar 10, 2017, at 6:32 AM, James B. Byrne byrnejb@harte-lyne.ca wrote:
On Thu, March 9, 2017 09:46, John Hodrien wrote:
fsck's not good at finding disk errors, it finds filesystem errors.
If not fsck then what?
badblocks(8).
And I definitely will unmount relevant filesystem(s) before using badblocks...
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On Mar 10, 2017, at 9:28 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Fri, March 10, 2017 9:52 am, Warren Young wrote:
On Mar 10, 2017, at 6:32 AM, James B. Byrne byrnejb@harte-lyne.ca wrote:
On Thu, March 9, 2017 09:46, John Hodrien wrote:
fsck's not good at finding disk errors, it finds filesystem errors.
If not fsck then what?
badblocks(8).
And I definitely will unmount relevant filesystem(s) before using badblocks…
You don’t necessarily have to. The default mode of badblocks is a non-invasive read-only test which is safe to run on a mounted filesystem.
That said, a read-only badblocks pass can give a false “no errors” report in cases where a non-destructive read-then-write pass (-n) will show errors.
Alternatively, a read-only pass may show an error that a read-then-write pass will silently bury by forcing the drive to relocate the bad sector.
In extreme cases, you could potentially fix a problem with a read-random-random-write pass (-n -t random -t random) because that will statistically flip all the bits at least twice, which may rub the drive’s nose in a bad sector, forcing a reallocation where a normal read-then-write pass (-n alone) may not.
Hard drives are weird. It is only through the grace of ECC and such that they approximate deterministic behavior as well as they do.
I get up around 0630, u can come anytime after that. I want to hit the range that morning but if I KNEW when you are arriving, I could plan around that...
On Mar 10, 2017, at 9:28 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Fri, March 10, 2017 9:52 am, Warren Young wrote:
On Mar 10, 2017, at 6:32 AM, James B. Byrne byrnejb@harte-lyne.ca wrote:
On Thu, March 9, 2017 09:46, John Hodrien wrote:
fsck's not good at finding disk errors, it finds filesystem errors.
If not fsck then what?
badblocks(8).
And I definitely will unmount relevant filesystem(s) before using badblocksâ¦
You donât necessarily have to. The default mode of badblocks is a non-invasive read-only test which is safe to run on a mounted filesystem.
That said, a read-only badblocks pass can give a false âno errorsâ report in cases where a non-destructive read-then-write pass (-n) will show errors.
Alternatively, a read-only pass may show an error that a read-then-write pass will silently bury by forcing the drive to relocate the bad sector.
In extreme cases, you could potentially fix a problem with a read-random-random-write pass (-n -t random -t random) because that will statistically flip all the bits at least twice, which may rub the driveâs nose in a bad sector, forcing a reallocation where a normal read-then-write pass (-n alone) may not.
Hard drives are weird. It is only through the grace of ECC and such that they approximate deterministic behavior as well as they do. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Talk about missing the email I wanted to reply too. Disregard...
On Mar 10, 2017, at 9:28 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Fri, March 10, 2017 9:52 am, Warren Young wrote:
On Mar 10, 2017, at 6:32 AM, James B. Byrne byrnejb@harte-lyne.ca wrote:
On Thu, March 9, 2017 09:46, John Hodrien wrote:
fsck's not good at finding disk errors, it finds filesystem errors.
If not fsck then what?
badblocks(8).
And I definitely will unmount relevant filesystem(s) before using badblocksâ¦
You donât necessarily have to. The default mode of badblocks is a non-invasive read-only test which is safe to run on a mounted filesystem.
That said, a read-only badblocks pass can give a false âno errorsâ report in cases where a non-destructive read-then-write pass (-n) will show errors.
Alternatively, a read-only pass may show an error that a read-then-write pass will silently bury by forcing the drive to relocate the bad sector.
In extreme cases, you could potentially fix a problem with a read-random-random-write pass (-n -t random -t random) because that will statistically flip all the bits at least twice, which may rub the driveâs nose in a bad sector, forcing a reallocation where a normal read-then-write pass (-n alone) may not.
Hard drives are weird. It is only through the grace of ECC and such that they approximate deterministic behavior as well as they do. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
James B. Byrne wrote:
On Thu, March 9, 2017 09:46, John Hodrien wrote:
On Thu, 9 Mar 2017, James B. Byrne wrote:
This indicated that a bad sector on the underlying disk system might be the source of the problem. The guests were all shutdown, a /forcefsck file was created on the host system, and the host system remotely restarted.
fsck's not good at finding disk errors, it finds filesystem errors.
If not fsck then what?
fsck run with -c, which forces badblocks to run. Or you can run that directly.
If it was a real disk issue, you'd expect matching errors in the host logs.
Yes, there are:
Mar 9 09:14:13 vhost03 kernel: end_request: I/O error, dev sda, sector 1236929063 Mar 9 09:14:30 vhost03 kernel: end_request: I/O error, dev sda, sector 1236929063 Mar 9 09:14:48 vhost03 kernel: end_request: I/O error, dev sda, sector 1236929063
Looks like only one sector's bad. Running badblocks should, I think, mark that sector as bad, so the system doesn't try to read or write there. I've got a user whose workstation has had a bad sector running for over a year. However, if it becomes two, or four, or 64 sectors, it's replacement time, asap. <snip> mark