My recently installed CentOS 4.3 system (which is the kickstart server for other systems -- see other thread) is currently hung. I can ping it, but can't do much of anything else. When I switch to one of the virtual consoles, I can briefly see the "login:" prompt, but then that window immediately fills with up with the following message:
EXT3-fs error (devive dm-0) in start_transaaction: Journal has aborted
This is displayed over and over again. I used the default (automatic) partitioning scheme, so the root file system is an ext3 partition on a logical volume that spans most of the disk (everything except for the swap partition and /boot). Anyone else seen this? Any ideas as to what would cause this? I'll be rebooting the system shortly...
Alfred
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, Jun 01, 2006 at 02:51:42PM -0400, Alfred von Campe wrote:
My recently installed CentOS 4.3 system (which is the kickstart server for other systems -- see other thread) is currently hung. I can ping it, but can't do much of anything else. When I switch to one of the virtual consoles, I can briefly see the "login:" prompt, but then that window immediately fills with up with the following message:
EXT3-fs error (devive dm-0) in start_transaaction: Journal has aborted
This is displayed over and over again. I used the default (automatic) partitioning scheme, so the root file system is an ext3 partition on a logical volume that spans most of the disk (everything except for the swap partition and /boot). Anyone else seen this? Any ideas as to what would cause this? I'll be rebooting the system shortly...
Is it a SATA disk ? I was reading a thread on WHT a few days ago about a server that kept getting its filesystem corrupted in that way, and was only solved (for good) after replacing the SATA cable.
- -- Rodrigo Barbosa rodrigob@suespammers.org "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)
On Jun 1, 2006, at 15:02, Rodrigo Barbosa wrote:
Is it a SATA disk ? I was reading a thread on WHT a few days ago about a server that kept getting its filesystem corrupted in that way, and was only solved (for good) after replacing the SATA cable.
Yes, it is a SATA disk. Kudos to whoever figured out that it was the cable; that must have been a touch one to debug. I'll replace the cable to see if it fixes the problem, although it is difficult to reproduce (it takes many days for it to occur).
Alfred
On Thu, 2006-06-01 at 16:05 -0400, Alfred von Campe wrote:
On Jun 1, 2006, at 15:02, Rodrigo Barbosa wrote:
Is it a SATA disk ? I was reading a thread on WHT a few days ago about a server that kept getting its filesystem corrupted in that way, and was only solved (for good) after replacing the SATA cable.
Yes, it is a SATA disk. Kudos to whoever figured out that it was the cable; that must have been a touch one to debug. I'll replace the cable to see if it fixes the problem, although it is difficult to reproduce (it takes many days for it to occur).
If you are not using the dmraid package ... you can remove that too ... it some times causes issues on sata drives.
On Jun 1, 2006, at 17:37, Johnny Hughes wrote:
If you are not using the dmraid package ... you can remove that too ... it some times causes issues on sata drives.
The drmraid RPM is installed. How do I know if it's being used or not? Even if I'm not using RAID (SW or HW), can this package cause a problem? On a possibly related note, the smartd service is the only one that fails to start properly. Could this be an indication of a problem?
Alfred
On Thu, 2006-06-01 at 20:06 -0400, Alfred von Campe wrote:
On Jun 1, 2006, at 17:37, Johnny Hughes wrote:
If you are not using the dmraid package ... you can remove that too ... it some times causes issues on sata drives.
The drmraid RPM is installed. How do I know if it's being used or not? Even if I'm not using RAID (SW or HW), can this package cause a problem? On a possibly related note, the smartd service is the only one that fails to start properly. Could this be an indication of a problem?
Smartd doesn't work on many SATA drives ... the libata was written for pata, and it sometimes causes issues.
If you are not using RAID, you can remove the dmraid package. Yes, sometimes it can cause issues even if you are not using RAID.
On Jun 2, 2006, at 6:57, Johnny Hughes wrote:
Smartd doesn't work on many SATA drives ... the libata was written for pata, and it sometimes causes issues.
I disabled smartd.
If you are not using RAID, you can remove the dmraid package. Yes, sometimes it can cause issues even if you are not using RAID.
I removed the dmraid RPM and I replaced the SATA cable as suggested in another reply, but the problem still persists. Every 2-3 days I get the error in the subject line, and I have to reboot the system. I suspect a HW problem, but I've replaced the disk already, and I don't know what else to try. There is nothing in /var/log/messages that looks suspicious (actually, there is literally nothing in /var/ log/messages since Tuesday night until the reboot this morning). That makes sense, if the file system is getting corrupted, it can't write errors to a log file. Any ideas what I can do next to debug this problem?
Alfred
On 08/06/06, Alfred von Campe alfred@110.net wrote:
I removed the dmraid RPM and I replaced the SATA cable as suggested in another reply, but the problem still persists. Every 2-3 days I get the error in the subject line, and I have to reboot the system. I suspect a HW problem, but I've replaced the disk already, and I don't know what else to try. There is nothing in /var/log/messages that looks suspicious (actually, there is literally nothing in /var/ log/messages since Tuesday night until the reboot this morning). That makes sense, if the file system is getting corrupted, it can't write errors to a log file. Any ideas what I can do next to debug this problem?
Would setting up a netdump server provide anymore information?
Will.
On Thu, 2006-06-08 at 07:58 -0400, Alfred von Campe wrote:
On Jun 2, 2006, at 6:57, Johnny Hughes wrote:
Smartd doesn't work on many SATA drives ... the libata was written for pata, and it sometimes causes issues.
I disabled smartd.
If you are not using RAID, you can remove the dmraid package. Yes, sometimes it can cause issues even if you are not using RAID.
I removed the dmraid RPM and I replaced the SATA cable as suggested in another reply, but the problem still persists. Every 2-3 days I get the error in the subject line, and I have to reboot the system. I suspect a HW problem, but I've replaced the disk already, and I don't know what else to try. There is nothing in /var/log/messages that looks suspicious (actually, there is literally nothing in /var/ log/messages since Tuesday night until the reboot this morning). That makes sense, if the file system is getting corrupted, it can't write errors to a log file. Any ideas what I can do next to debug this problem?
Alfred
If this hasn't worked for a long period of time and then just failed (in other words, this is a new install and has never worked properly) then I would suspect driver related issues.
I would suggest the following:
1. Make sure you have the latest system BIOS available from the motherboard manufacturer. If you have a controller for the SATA drives that is not on the motherboard, make sure it has the latest BIOS offered by the manufacturer.
2. Make sure you have the latest bios for the hard drive(s) in question if there are bios updates provided from the hard drive manufacturer (that is the case with some SATA hard drives).
3. Look in the BIOS for settings that concern the drives (either in the motherboard or a separate controller) and ensure you understand what each one does and that they are set appropriately for Linux operations.
4. See if the controller manufacturer or the motherboard manufacturer provide Linux Drivers for the SATA controllers that might be newer than the ones in the Linux kernel.
On Thu, 2006-06-08 at 07:37 -0500, Johnny Hughes wrote:
On Thu, 2006-06-08 at 07:58 -0400, Alfred von Campe wrote:
On Jun 2, 2006, at 6:57, Johnny Hughes wrote:
<snip>
I would suggest the following:
- Make sure you have the latest system BIOS available from the
motherboard manufacturer. If you have a controller for the SATA drives that is not on the motherboard, make sure it has the latest BIOS offered by the manufacturer.
- Make sure you have the latest bios for the hard drive(s) in question
if there are bios updates provided from the hard drive manufacturer (that is the case with some SATA hard drives).
- Look in the BIOS for settings that concern the drives (either in the
motherboard or a separate controller) and ensure you understand what each one does and that they are set appropriately for Linux operations.
- See if the controller manufacturer or the motherboard manufacturer
provide Linux Drivers for the SATA controllers that might be newer than the ones in the Linux kernel.
<snip sig stuff>
If I may also suggest:
5. Check with the support orgs/sites of all component manufacturers (possibly even the distributor you purchased from) to see if they have knowledge of issues and resolutions. I've noticed over the years that often a problem with a particular combination of hardware/software is encountered/reported/solved between the end-customer and a support organization earlier than elsewhere. Even if not, the *good* support organizations will strive to help because they know that it is likely that others will encounter the problem.
HTH
William L. Maltby wrote:
On Thu, 2006-06-08 at 07:37 -0500, Johnny Hughes wrote:
On Thu, 2006-06-08 at 07:58 -0400, Alfred von Campe wrote:
On Jun 2, 2006, at 6:57, Johnny Hughes wrote:
<snip>
I would suggest the following:
- Make sure you have the latest system BIOS available from the
motherboard manufacturer. If you have a controller for the SATA drives that is not on the motherboard, make sure it has the latest BIOS offered by the manufacturer.
- Make sure you have the latest bios for the hard drive(s) in question
if there are bios updates provided from the hard drive manufacturer (that is the case with some SATA hard drives).
- Look in the BIOS for settings that concern the drives (either in the
motherboard or a separate controller) and ensure you understand what each one does and that they are set appropriately for Linux operations.
- See if the controller manufacturer or the motherboard manufacturer
provide Linux Drivers for the SATA controllers that might be newer than the ones in the Linux kernel.
<snip sig stuff>
If I may also suggest:
- Check with the support orgs/sites of all component manufacturers
(possibly even the distributor you purchased from) to see if they have knowledge of issues and resolutions. I've noticed over the years that often a problem with a particular combination of hardware/software is encountered/reported/solved between the end-customer and a support organization earlier than elsewhere. Even if not, the *good* support organizations will strive to help because they know that it is likely that others will encounter the problem.
This is not directly related but is the box stable in Windows? Hardware issues like the one below will show up in any OS.
http://www.3ware.com/kb/article.aspx?id=10964
Another thing, the SATA controller does not happen to be a Silicon Image chip does it? I heard that the driver was supposed to be fixed but there were still some odd reports last I followed it. Them chips and certain hard drives don't like each other but I cannot remember which.
Thanks to all for your suggestions. I was away from the office all day today, and will only be there briefly tomorrow before going on vacation for a week. So I won't get a chance to try out all your suggestions until I get back a week from Monday. I am sure that the system will hang in my absence, but there is not much I can do about that (other than shutting it down).
Alfred
It's time to resurrect this thread from way back in June. The problem in the subject line has reared its ugly head again, but this time with a twist that makes it much worse. A little refresher on what was happening back then. Every so often the root file system would be remounted read-only, with the error in the subject line appearing over and over again on the console.
Lately, this has been happening every 10-14 days, and I would have to reboot my system. Since the root file system was not writable, no error messages were logged in /var/log/messages. So I configured syslog to write messages to another system as well, and this time I have captured some errors (see below). BTW, this is a SATA drive.
What makes it much worse this time, is that the system won't boot! When I try to boot now I get the following error over and over again:
ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
HELP! Is there anything I can do to recover this system?
Alfred
Here are the first 50 lines from /var/log/messages (including the first occurrence of the error in the subject line)
Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 224365 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 233795925 Aug 1 18:57:04 balboa01 kernel: Buffer I/O error on device dm-0, logical block 29198337 Aug 1 18:57:04 balboa01 kernel: lost page write due to I/O error on dm-0 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 224373 Aug 1 18:57:04 balboa01 kernel: Buffer I/O error on device dm-0, logical block 1893 Aug 1 18:57:04 balboa01 kernel: lost page write due to I/O error on dm-0 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: Aborting journal on device dm-0. Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 172585309 Aug 1 18:57:04 balboa01 kernel: Buffer I/O error on device dm-0, logical block 21547010 Aug 1 18:57:04 balboa01 kernel: lost page write due to I/O error on dm-0 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: ext3_abort called. Aug 1 18:57:04 balboa01 kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Aug 1 18:57:04 balboa01 kernel: Remounting filesystem read-only Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: EXT3-fs error (device dm-0) in start_transaction: Journal has aborted Aug 1 18:57:34 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:34 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Alfred von Campe wrote:
It's time to resurrect this thread from way back in June. The problem in the subject line has reared its ugly head again, but this time with a twist that makes it much worse. A little refresher on what was happening back then. Every so often the root file system would be remounted read-only, with the error in the subject line appearing over and over again on the console.
Lately, this has been happening every 10-14 days, and I would have to reboot my system. Since the root file system was not writable, no error messages were logged in /var/log/messages. So I configured syslog to write messages to another system as well, and this time I have captured some errors (see below). BTW, this is a SATA drive.
What makes it much worse this time, is that the system won't boot! When I try to boot now I get the following error over and over again:
ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
HELP! Is there anything I can do to recover this system?
Alfred
Here are the first 50 lines from /var/log/messages (including the first occurrence of the error in the subject line)
Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 224365 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 233795925 Aug 1 18:57:04 balboa01 kernel: Buffer I/O error on device dm-0, logical block 29198337 Aug 1 18:57:04 balboa01 kernel: lost page write due to I/O error on dm-0 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 224373 Aug 1 18:57:04 balboa01 kernel: Buffer I/O error on device dm-0, logical block 1893 Aug 1 18:57:04 balboa01 kernel: lost page write due to I/O error on dm-0 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: Aborting journal on device dm-0. Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 Aug 1 18:57:04 balboa01 kernel: ata1: status=0xb7 { Busy } Aug 1 18:57:04 balboa01 kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Aug 1 18:57:04 balboa01 kernel: Current sda: sense key Aborted Command Aug 1 18:57:04 balboa01 kernel: Additional sense: Scsi parity error Aug 1 18:57:04 balboa01 kernel: end_request: I/O error, dev sda, sector 172585309 Aug 1 18:57:04 balboa01 kernel: Buffer I/O error on device dm-0, logical block 21547010 Aug 1 18:57:04 balboa01 kernel: lost page write due to I/O error on dm-0 Aug 1 18:57:04 balboa01 kernel: ATA: abnormal status 0xB7 on port 0x1F7 Aug 1 18:57:04 balboa01 last message repeated 2 times Aug 1 18:57:04 balboa01 kernel: ext3_abort called. Aug 1 18:57:04 balboa01 kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Aug 1 18:57:04 balboa01 kernel: Remounting filesystem read-only Aug 1 18:57:04 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:04 balboa01 kernel: EXT3-fs error (device dm-0) in start_transaction: Journal has aborted Aug 1 18:57:34 balboa01 kernel: ata1: command 0x35 timeout, stat 0xb7 host_stat 0x21 Aug 1 18:57:34 balboa01 kernel: ata1: translated ATA stat/err 0xb7/00 to SCSI SK/ASC/ASCQ 0xb/47/00 _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Maybe the disk is dying? Did you run smartd (it requires -d ata for SATA disks; this option needs to be put in smartd.conf)?
The error messages could also indicate bad cables.
I would boot from the CentOS 4.3 Live-CD, and take a look at the disk with smartctl. If the disk is indeed dying, I'd try to save its contents to a fresh disk, using ddrescue. Unfortunately there are 2 programs with this name (http://www.garloff.de/kurt/linux/ddrescue/ and http://www.gnu.org/software/ddrescue/ddrescue.html); I have very good results with the latter - don't know if it's on the LiveCD (if not, it should!).
If the disk shows no SMART errors you could use e2fsck.
HTH,
Kay
Maybe the disk is dying? Did you run smartd (it requires -d ata for SATA disks; this option needs to be put in smartd.conf)?
It's a brand new disk (well, less than three months old), and it pretty much did it from the get-go (as did the previous disk). I've replaced the SATA cable and updated the system BIOS (it's a Lenovo ThinkCentre M51, and I am using the on-board SATA controller) as previously suggested. I was hoping that the errors logged with syslog would help uncover the root cause. It just so happens that the first time after I configure syslog to log to another system, the disk becomes unbootable.
The error messages could also indicate bad cables.
I already replaced the cable, and this is is an intermittent error.
I would boot from the CentOS 4.3 Live-CD, and take a look at the disk with smartctl. If the disk is indeed dying, I'd try to save its contents to a fresh disk, using ddrescue. Unfortunately there are 2 programs with this name (http://www.garloff.de/kurt/linux/ ddrescue/ and http://www.gnu.org/software/ddrescue/ddrescue.html); I have very good results with the latter - don't know if it's on the LiveCD (if not, it should!).
Great idea, booting from the CD as I type this. I had tried booting the install CD in rescue mode, but that resulted in a kernel panic when it tried to mount the disk. Let's hope I have more luck with the LiveCD.
Alfred
On Wed, 2 Aug 2006, Alfred von Campe wrote:
Maybe the disk is dying? Did you run smartd (it requires -d ata for SATA disks; this option needs to be put in smartd.conf)?
It's a brand new disk (well, less than three months old), and it pretty much did it from the get-go (as did the previous disk). I've replaced the SATA cable and updated the system BIOS (it's a Lenovo ThinkCentre M51, and I am using the on-board SATA controller) as previously suggested. I was hoping that the errors logged with syslog would help uncover the root cause. It just so happens that the first time after I configure syslog to log to another system, the disk becomes unbootable.
The error messages could also indicate bad cables.
I already replaced the cable, and this is is an intermittent error.
I would boot from the CentOS 4.3 Live-CD, and take a look at the disk with smartctl. If the disk is indeed dying, I'd try to save its contents to a fresh disk, using ddrescue. Unfortunately there are 2 programs with this name (http://www.garloff.de/kurt/linux/ddrescue/ and http://www.gnu.org/software/ddrescue/ddrescue.html); I have very good results with the latter - don't know if it's on the LiveCD (if not, it should!).
Great idea, booting from the CD as I type this. I had tried booting the install CD in rescue mode, but that resulted in a kernel panic when it tried to mount the disk. Let's hope I have more luck with the LiveCD.
I would also recommend running a long SMART self-test on the drive. If you capture the SMART attributes before and after the test, it is actually pretty easy to locate the source of the problem (e.g., host controller vs disk disk vs. bad sector ) by comparing the SMART attributes that were captured. If you want additional details, check out the following article:
http://prefetch.net/articles/diskdrives.smart.html
Thanks, - Ryan -- UNIX Administrator http://prefetch.net
On Wed, 2006-08-02 at 11:12 -0400, Matty wrote:
On Wed, 2 Aug 2006, Alfred von Campe wrote:
Maybe the disk is dying? Did you run smartd (it requires -d ata for SATA disks; this option needs to be put in smartd.conf)?
It's a brand new disk (well, less than three months old), and it pretty much
<snip>
I would also recommend running a long SMART self-test on the drive. If you capture the SMART attributes before and after the test, it is actually pretty easy to locate the source of the problem (e.g., host controller vs disk disk vs. bad sector ) by comparing the SMART attributes that were captured. If you want additional details, check out the following article:
Hmmm. Maybe something I've seen here is related? Regardless, it raises a Q for me.
I'vew a couple commodity IDE ultra drives that have S.M.A.R.T. technology. Diffent sizes, different manufacturers. S.M.A.R.T is disabled in my BIOS. Both drives fail boot after a poer off period. But if I wait for a while after poer on (5 - 10 minutes?), either boots fine. And both have no problem with warm boots.
Symptoms vary from "crc error" after "booting..." message from grub(?) or things just "freeze.
Well, spin up on them shows
FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 0x0007 093 092 021 Pre-fail Always - 1866 0x0027 252 252 063 Pre-fail Always - 1457
Now, the Q is: what do the numbers mean? Seconds? Milliseconds? If it's seconds, the "RAW_VALUE" may explain why all is OK after things have been powered up long enough. If it's ms, I can only thing it is running a self-test. I would have to go read those articles more closely to see what I can determine.
Anyway, I thought the OP might have a similar thing affcting him. Delay in spin-up after sleeping or some other smart-related setting.
Thanks,
- Ryan
-- UNIX Administrator http://prefetch.net _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Wed, 2 Aug 2006, William L. Maltby wrote:
FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 0x0007 093 092 021 Pre-fail Always - 1866 0x0027 252 252 063 Pre-fail Always - 1457
Now, the Q is: what do the numbers mean? Seconds? Milliseconds? If it's seconds, the "RAW_VALUE" may explain why all is OK after things have been powered up long enough. If it's ms, I can only thing it is running a self-test. I would have to go read those articles more closely to see what I can determine.
Hi William,
The attribute names (as listed in the "ATTRIBUTE_NAME" column) are described here:
http://en.wikipedia.org/wiki/Self-Monitoring%2C_Analysis%2C_and_Reporting_Te...
The VALUE column contains a normalized value for the attribute, WORST contains the drives lifetime minimum (or maximum value), THRESH contains the drive manufactures failure threshold, and RAW_VALUE contains a 6-byte value that is used to store the attributes raw value. I see that several sectors are marked as unreadable in the logfiles you posted. What do you see in the column "Reallocated_Sector_Ct?" If the SMART attributes check out, the disk is most likely fine (you can manually run a long self test to be sure), and I would start looking at the SATA controller ( most likely culprit) and the SATA device driver (you could easily integrate some debugging data into the driver to assist with debugging the problem).
Hope this helps, - Ryan -- UNIX Administrator http://prefetch.net
On Wed, 2006-08-02 at 12:41 -0400, Matty wrote:
On Wed, 2 Aug 2006, William L. Maltby wrote:
FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 0x0007 093 092 021 Pre-fail Always - 1866 0x0027 252 252 063 Pre-fail Always - 1457
Now, the Q is: what do the numbers mean? Seconds? Milliseconds? If it's seconds, the "RAW_VALUE" may explain why all is OK after things have been powered up long enough. If it's ms, I can only thing it is running a self-test. I would have to go read those articles more closely to see what I can determine.
Hi William,
The attribute names (as listed in the "ATTRIBUTE_NAME" column) are described here:
http://en.wikipedia.org/wiki/Self-Monitoring%2C_Analysis%2C_and_Reporting_Te...
I went there. It dfidn't tell me the unit of measure though.
The VALUE column contains a normalized value for the attribute, WORST contains the drives lifetime minimum (or maximum value), THRESH contains the drive manufactures failure threshold, and RAW_VALUE contains a 6-byte value that is used to store the attributes raw value. I see that several sectors are marked as unreadable in the logfiles you posted.
OOPS! That's the other guy. I only thought that maybe some of his problems might be related to what you started with, S.M.A.R.T., and I had some synptoms that might also be related to S.M.A.R.T and might be hitting him too.
So I posted my symptoms just to tickle a thought.
What do you see in the column "Reallocated_Sector_Ct?"
Regardless, mine are all zero. My big drive does have about 1400+ ECC corrections, but that on a 100GB drive. So I'm cool with that.
If the SMART attributes check out, <sniip>
Hope this helps,
- Ryan
<snip sig stuff>
On Aug 2, 2006, at 11:12, Matty wrote:
I would also recommend running a long SMART self-test on the drive. If you capture the SMART attributes before and after the test, it is actually pretty easy to locate the source of the problem (e.g., host controller vs disk disk vs. bad sector ) by comparing the SMART attributes that were captured. If you want additional details, check out the following article:
Thanks for the URL. I'd love to run a SMART analysis, but apparently smartctl doesn't support SATA drives. At least the version that comes with CentOS doesn't (I haven't tried to rebuild it from sources - yet)!
I tried accessing the drive from the CentOS LiveCD, and wasn't successful. There are two partitions on the drive, one for /boot, and the rest of the disk is managed by LVM. I was able to mount the / boot partition, but I couldn't read the grub directory due to apparent corruption (that's a bad sign right there). But maybe I can recover some of the data from the rest of the disk. How do I mount the logical partitions managed by LVM from the command line? I haven't had a chance to google for this/read the man page, so if someone has a quick synopsis handy, I would really appreciate it.
Alfred
On Wed, Aug 02, 2006 at 02:41:18PM -0400, Alfred von Campe wrote:
Thanks for the URL. I'd love to run a SMART analysis, but apparently smartctl doesn't support SATA drives. At least the version that comes with CentOS doesn't (I haven't tried to rebuild it from sources
- yet)!
I believe you actually need a newer libata kernel module. So, kinda out of luck there.
Alfred von Campe wrote:
On Aug 2, 2006, at 11:12, Matty wrote:
I would also recommend running a long SMART self-test on the drive. If you capture the SMART attributes before and after the test, it is actually pretty easy to locate the source of the problem (e.g., host controller vs disk disk vs. bad sector ) by comparing the SMART attributes that were captured. If you want additional details, check out the following article:
Thanks for the URL. I'd love to run a SMART analysis, but apparently smartctl doesn't support SATA drives. At least the version that comes with CentOS doesn't (I haven't tried to rebuild it from sources - yet)!
I tried accessing the drive from the CentOS LiveCD, and wasn't successful. There are two partitions on the drive, one for /boot, and the rest of the disk is managed by LVM. I was able to mount the / boot partition, but I couldn't read the grub directory due to apparent corruption (that's a bad sign right there). But maybe I can recover some of the data from the rest of the disk. How do I mount the logical partitions managed by LVM from the command line? I haven't had a chance to google for this/read the man page, so if someone has a quick synopsis handy, I would really appreciate it.
Alfred _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Alfred,
the CentOS smartctl _does_ work (for me at least) with SATA disks if you use the "-d ata" option. So please try smartctl -d ata -a /dev/sda You can run a long test with smartctl -d ata -t long /dev/sda
You better not try to e2fsck a dying disk ... try to copy it with ddrescue, and run e2fsck on the copy, and mount the copy afterwards. An external USB disk is very handy for that purpose.
Kay
On Aug 2, 2006, at 14:49, Kay Diederichs wrote:
the CentOS smartctl _does_ work (for me at least) with SATA disks if you use the "-d ata" option. So please try smartctl -d ata -a /dev/sda You can run a long test with smartctl -d ata -t long /dev/sda
Yes, cool, this works. Of course by now I have another drive in this system and already rebuilt CentOS on it. But I'll boot back in the CentOS LiveCD later on and probe the old drive.
Thanks, Alfred
the CentOS smartctl _does_ work (for me at least) with SATA disks if you use the "-d ata" option. So please try smartctl -d ata -a /dev/sda
I was able to run this command and the output is attached. Does anything there point to a disk HW failure? I still think that I'm just suffering from file system corruption.
You can run a long test with smartctl -d ata -t long /dev/sda
I just started this now, and it told me to wait for 70 minutes for the test to complete. How do I check its results?
BTW, I was able to mount the LVM partition and copy some critical files (see my next email). After I did that I first tried to run the long test and got a kernel panic. But that was after getting a bunch of errors when accessing the drive.
Alfred
On Wed, 2 Aug 2006, Alfred von Campe wrote:
the CentOS smartctl _does_ work (for me at least) with SATA disks if you use the "-d ata" option. So please try smartctl -d ata -a /dev/sda
I was able to run this command and the output is attached. Does anything there point to a disk HW failure? I still think that I'm just suffering from file system corruption.
You can run a long test with smartctl -d ata -t long /dev/sda
I just started this now, and it told me to wait for 70 minutes for the test to complete. How do I check its results?
You an use the smartctl "-l" option to view the results:
$ smartctl -l selftest /dev/hda
- Ryan -- UNIX Administrator http://prefetch.net
On Wed, 2 Aug 2006 at 6:02pm, Alfred von Campe wrote
the CentOS smartctl _does_ work (for me at least) with SATA disks if you use the "-d ata" option. So please try smartctl -d ata -a /dev/sda
I was able to run this command and the output is attached. Does anything there point to a disk HW failure? I still think that I'm just suffering from file system corruption.
You shouldn't see *any* (IMO) ATA errors in the SMART Error log of a healthy disk -- yours has 273. Also, you have 1 Reallocated Sector, 16 Pending reallocations, and 10 "Offline Uncorrectable" errors. It really looks to me like the disk is bad.
Download Maxtor's disk assessment tool and run it on the disk. It'll likely tell you it's bad and give you a code you can give to Maxtor to get your warranty replacement.
Alfred von Campe wrote:
the CentOS smartctl _does_ work (for me at least) with SATA disks if you use the "-d ata" option. So please try smartctl -d ata -a /dev/sda
I was able to run this command and the output is attached. Does anything there point to a disk HW failure? I still think that I'm just suffering from file system corruption.
You can run a long test with smartctl -d ata -t long /dev/sda
I just started this now, and it told me to wait for 70 minutes for the test to complete. How do I check its results?
BTW, I was able to mount the LVM partition and copy some critical files (see my next email). After I did that I first tried to run the long test and got a kernel panic. But that was after getting a bunch of errors when accessing the drive.
Alfred
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
This disk is bad, as can be seen from
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 1 197 Current_Pending_Sector 0x0008 252 252 000 Old_age Offline - 16 198 Offline_Uncorrectable 0x0008 243 243 000 Old_age Offline - 10
These errors (partilculary those indicated by the last 2 lines) are definitely responsible for the messages in the logs, the filesystem corruption, and data loss.
There are Maxtor tools (at their webpage) to re-initialize the disk, and if you're lucky, the disk will be good the next 10 years. But the errors might as well show up again soon. So I'd return the disk if the data on the disk are important to you.
smartctl -a -d ata /dev/sda gives you all the info, including the info on the progress of the selftest.
HTH,
Kay
On Aug 3, 2006, at 2:14, Kay Diederichs wrote:
This disk is bad, as can be seen from
[snip]
OK, that's actually good news. The only reason I was doubting it was the disk was that this is the second disk that had problems, although the problems with the first disk may have been something completely different (but that was also with a different Linux distribution).
I will send this disk back to Maxtor after running the Maxtor utility.
Thanks for all your help. It was a great learning experience and I was able to recover all the critical data. The system is back up and running with a spare drive while I wait for the replacement to arrive.
Alfred
On Wed, 2006-08-02 at 14:41 -0400, Alfred von Campe wrote:
On Aug 2, 2006, at 11:12, Matty wrote:
<snip>
I tried accessing the drive from the CentOS LiveCD, and wasn't successful. There are two partitions on the drive, one for /boot, and the rest of the disk is managed by LVM. I was able to mount the / boot partition, but I couldn't read the grub directory due to apparent corruption (that's a bad sign right there). But maybe I can recover some of the data from the rest of the disk. How do I mount the logical partitions managed by LVM from the command line?
I'm new, but I think if you add it to a system, some changed things will appear on boot. Do a
pvscan --verbose vgdisplay --verbose lvdisplay --verbose xxxx # for each volgroup
to get a layout. I *suspect* the LVM routines have everything set up as far as the physical aspects go. But the normal VolGroup00/LogVol00 is already used on your machine. I'm not sure how LVM will place the new volume into the system. If lucky, just find the physical volume and makes new /dev/mapper entries. That would be good. I've not had to do this carrying a disk to another machine.
Depending on what it shows, we might be able to proceed easily.
Post results here.
If it's beyond me, I'm sure some of the more experienced folks will be able to help.
<snip> Alfred <snip sig stuff>
HTH
On Aug 2, 2006, at 15:11, William L. Maltby wrote:
I'm new, but I think if you add it to a system, some changed things will appear on boot. Do a
pvscan --verbose vgdisplay --verbose lvdisplay --verbose xxxx # for each volgroup
to get a layout. I *suspect* the LVM routines have everything set up as far as the physical aspects go. But the normal VolGroup00/LogVol00 is already used on your machine. I'm not sure how LVM will place the new volume into the system. If lucky, just find the physical volume and makes new /dev/mapper entries. That would be good. I've not had to do this carrying a disk to another machine.
I was able to mount the partition I needed, and accessing some subdirectories was no problem at all. But others, including the top level directory, would generate lots of errors. I was able to copy (using scp) some critical files I needed. I'm still not convinced that the disk suffered a HW failure (I'm running the a long SMART test now). How can I determine if indeed the disk failed, as opposed to the file system just getting corrupted? I've salvaged everything I need off of this disk, so running fsck on it is not a problem. I would like to determine the root cause, though.
Alfred
Alfred von Campe spake the following on 8/2/2006 3:06 PM:
On Aug 2, 2006, at 15:11, William L. Maltby wrote:
I'm new, but I think if you add it to a system, some changed things will appear on boot. Do a
pvscan --verbose vgdisplay --verbose lvdisplay --verbose xxxx # for each volgroup
to get a layout. I *suspect* the LVM routines have everything set up as far as the physical aspects go. But the normal VolGroup00/LogVol00 is already used on your machine. I'm not sure how LVM will place the new volume into the system. If lucky, just find the physical volume and makes new /dev/mapper entries. That would be good. I've not had to do this carrying a disk to another machine.
I was able to mount the partition I needed, and accessing some subdirectories was no problem at all. But others, including the top level directory, would generate lots of errors. I was able to copy (using scp) some critical files I needed. I'm still not convinced that the disk suffered a HW failure (I'm running the a long SMART test now). How can I determine if indeed the disk failed, as opposed to the file system just getting corrupted? I've salvaged everything I need off of this disk, so running fsck on it is not a problem. I would like to determine the root cause, though.
Alfred
If filesystem is not needed, you could make a new filesystem on it with check for bad blocks enabled. I think it is mke2fs -cc /dev/sda (or whatever it is) for a deep read/write test.
On Jun 8, 2006, at 8:37, Johnny Hughes wrote:
If this hasn't worked for a long period of time and then just failed (in other words, this is a new install and has never worked properly) then I would suspect driver related issues.
This is a brand new PC (IBM/Lenovo ThinkCentre M51) with 4GB of memory (but the Intel chipset can only address 3GB, so that's all that's available to the OS), a new 160 GB SATA drive, and a fresh install.
I would suggest the following:
- Make sure you have the latest system BIOS available from the
motherboard manufacturer. If you have a controller for the SATA drives that is not on the motherboard, make sure it has the latest BIOS offered by the manufacturer.
Since the PC is brand new (well, 2 months old), I assume it has the latest and greatest BIOS, but I will double check that.
- Make sure you have the latest bios for the hard drive(s) in
question if there are bios updates provided from the hard drive manufacturer (that is the case with some SATA hard drives).
The drive was purchased last week, so again I think it's the latest and greatest. But how do I check this?
- Look in the BIOS for settings that concern the drives (either in
the motherboard or a separate controller) and ensure you understand what each one does and that they are set appropriately for Linux operations.
I will do that when I'm back at the PC (I'm in training all day today).
- See if the controller manufacturer or the motherboard manufacturer
provide Linux Drivers for the SATA controllers that might be newer than the ones in the Linux kernel.
Good ideas. I will definitely have to look into this. Does anyone know if Lenovo is providing Linux drivers at this point?
Alfred
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, Jun 08, 2006 at 08:48:48AM -0400, Alfred von Campe wrote:
Does anyone know if Lenovo is providing Linux drivers at this point?
Some bad news for you:
http://www.crn.com/sections/infrastructure/infrastructure.jhtml?articleId=18...
- -- Rodrigo Barbosa "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)
Rodrigo Barbosa wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, Jun 08, 2006 at 08:48:48AM -0400, Alfred von Campe wrote:
Does anyone know if Lenovo is providing Linux drivers at this point?
Some bad news for you:
http://www.crn.com/sections/infrastructure/infrastructure.jhtml?articleId=18...
Rodrigo Barbosa "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFEiB6PpdyWzQ5b5ckRAlJ/AKC/vSmQw9DKds2k7MqUPzf/EOSCbACfYusU VhyZCcl+POHl1/SVdAROWeY= =J+Mg -----END PGP SIGNATURE-----
and for the good news read :
http://www.crn.com/sections/breakingnews/dailyarchives.jhtml?articleId=18870...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, Jun 08, 2006 at 03:08:34PM +0200, Kay Diederichs wrote:
Rodrigo Barbosa wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, Jun 08, 2006 at 08:48:48AM -0400, Alfred von Campe wrote:
Does anyone know if Lenovo is providing Linux drivers at this point?
Some bad news for you:
http://www.crn.com/sections/infrastructure/infrastructure.jhtml?articleId=18...
and for the good news read :
http://www.crn.com/sections/breakingnews/dailyarchives.jhtml?articleId=18870...
Hey, tkx! I missed that one.
- -- Rodrigo Barbosa "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)
On 01/06/06, Rodrigo Barbosa rodrigob@suespammers.org wrote:
On Thu, Jun 01, 2006 at 02:51:42PM -0400, Alfred von Campe wrote:
My recently installed CentOS 4.3 system (which is the kickstart server for other systems -- see other thread) is currently hung. I can ping it, but can't do much of anything else. When I switch to one of the virtual consoles, I can briefly see the "login:" prompt, but then that window immediately fills with up with the following message:
EXT3-fs error (devive dm-0) in start_transaaction: Journal has aborted
Is it a SATA disk ? I was reading a thread on WHT a few days ago about a server that kept getting its filesystem corrupted in that way, and was only solved (for good) after replacing the SATA cable.
Coincidentally, I've just had a Dell Poweredge 2650 with SCSI disks start blurting these exact same errors to the console today.
I've given it a kick and it's come back OK, if it started playing up again, I'll have a look at the disk seating (all SCA) etc.
Will.