Hi all.
I have a Dell SC440 running Centos 4.4. It has two 500GB disks in a RAID1 array using linux software raid (md1 is / and md0 is /boot). Recently the root file system was remounted read-only for some reason. The logs don't show anything unusual, presumably the file system was read-only before anythng was logged. Running dmesg showed this error repeated many times:
EXT3-fs error (device md1) in start_transaction: Journal has aborted
Before rebooting the system I checked the raid array status with cat /proc/mdstat and mdadm. No errors were shown. When rebooting the system I let fsck do a check of the file systems. It reported several errors and I let it fix them. The thing that's bothering me is I can't find any reason for the EXT3 error. Nothing seems to be wrong with the array or the hardware.
I also noticed a few worrying messages while booting up and shutting down. On boot-up mdamd complains that super-minor should only be declared once and then there's a message that File descriptor 21 is open. On shutdown there are messages that md0 and md1 are in immediate safe mode and that md1 is till in use.
Is my system totally hosed or can I ignore these warnings?
Thanks, Rasmus
On Mar 29, 2007, at 6:36, Rasmus Back wrote:
I have a Dell SC440 running Centos 4.4. It has two 500GB disks in a RAID1 array using linux software raid (md1 is / and md0 is /boot). Recently the root file system was remounted read-only for some reason. The logs don't show anything unusual, presumably the file system was read-only before anythng was logged. Running dmesg showed this error repeated many times:
EXT3-fs error (device md1) in start_transaction: Journal has aborted
I had the exact error 9 months or so ago (look for a similarly titled thread in the archives). It was a disk going bad. Get all the data off you need now and replace the disk ASAP. It may run for a few days/weeks before it gets mounted again read only, but eventually you will lose some data.
Alfred
On 3/29/07, Alfred von Campe alfred@110.net wrote:
On Mar 29, 2007, at 6:36, Rasmus Back wrote:
I have a Dell SC440 running Centos 4.4. It has two 500GB disks in a RAID1 array using linux software raid (md1 is / and md0 is /boot). Recently the root file system was remounted read-only for some reason. The logs don't show anything unusual, presumably the file system was read-only before anythng was logged. Running dmesg showed this error repeated many times:
EXT3-fs error (device md1) in start_transaction: Journal has aborted
I had the exact error 9 months or so ago (look for a similarly titled thread in the archives). It was a disk going bad. Get all the data off you need now and replace the disk ASAP. It may run for a few days/weeks before it gets mounted again read only, but eventually you will lose some data.
Hi Alfred.
Thanks for the pointer! The smart logs for my drives don't show any errors but I'll start a long selftest just to be sure. Although if it is a failing hard drive then the raid driver should kick it out of the array. Your system was a laptop with just one drive, right?
Rasmus Back wrote:
On 3/29/07, Alfred von Campe alfred@110.net wrote:
On Mar 29, 2007, at 6:36, Rasmus Back wrote:
I have a Dell SC440 running Centos 4.4. It has two 500GB disks in a RAID1 array using linux software raid (md1 is / and md0 is /boot). Recently the root file system was remounted read-only for some reason. The logs don't show anything unusual, presumably the file system was read-only before anythng was logged. Running dmesg showed this error repeated many times:
EXT3-fs error (device md1) in start_transaction: Journal has aborted
I had the exact error 9 months or so ago (look for a similarly titled thread in the archives). It was a disk going bad. Get all the data off you need now and replace the disk ASAP. It may run for a few days/weeks before it gets mounted again read only, but eventually you will lose some data.
Hi Alfred.
Thanks for the pointer! The smart logs for my drives don't show any errors but I'll start a long selftest just to be sure. Although if it is a failing hard drive then the raid driver should kick it out of the array. Your system was a laptop with just one drive, right? _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
There is a know bug with the mpt scsi driver which causes exactly that behaviour. We got bitten by it running vmware ESX virtual machines with centos 4.4 and rhel 4.4 in them. Esx uses the mpt driver by default, even if your box does not use the raid, then as far as my understanding goes, you could still get the error. It is explained in the links below.
Here are some useful links;
http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-jou...
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197158
http://www.vmware.com/community/thread.jspa?threadID=58121
I have 10 or so real heavy use RHEL 4.4 boxes and at least one box would do this at least once a week. I applied the patch and have not seen the problem again.
Hope this helps.
Brian.
On 3/29/07, Centos-admin redhat@mckerrs.net wrote:
Rasmus Back wrote:
On 3/29/07, Alfred von Campe alfred@110.net wrote:
On Mar 29, 2007, at 6:36, Rasmus Back wrote:
I have a Dell SC440 running Centos 4.4. It has two 500GB disks in a RAID1 array using linux software raid (md1 is / and md0 is /boot). Recently the root file system was remounted read-only for some reason. The logs don't show anything unusual, presumably the file system was read-only before anythng was logged. Running dmesg showed this error repeated many times:
EXT3-fs error (device md1) in start_transaction: Journal has aborted
I had the exact error 9 months or so ago (look for a similarly titled thread in the archives). It was a disk going bad. Get all the data off you need now and replace the disk ASAP. It may run for a few days/weeks before it gets mounted again read only, but eventually you will lose some data.
Hi Alfred.
Thanks for the pointer! The smart logs for my drives don't show any errors but I'll start a long selftest just to be sure. Although if it is a failing hard drive then the raid driver should kick it out of the array. Your system was a laptop with just one drive, right? _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
There is a know bug with the mpt scsi driver which causes exactly that behaviour. We got bitten by it running vmware ESX virtual machines with centos 4.4 and rhel 4.4 in them. Esx uses the mpt driver by default, even if your box does not use the raid, then as far as my understanding goes, you could still get the error. It is explained in the links below.
Here are some useful links;
http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-jou...
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197158
http://www.vmware.com/community/thread.jspa?threadID=58121
I have 10 or so real heavy use RHEL 4.4 boxes and at least one box would do this at least once a week. I applied the patch and have not seen the problem again.
Hi Brian,
Thanks a million for the links, my system does use the mpt driver (at least according to lspci and lsmod). This would at least give an explanation for the failure. Do you know if the problem is fixed in RHEL 5? The redhat bugzilla said that something has been changed in the mpt drive in 2.6.14, but wasn't clear on if those changes solved the problem. I might upgrade to Centos 5 when it's available anyway.
Rasmus
On Mar 29, 2007, at 7:22, Rasmus Back wrote:
Thanks for the pointer! The smart logs for my drives don't show any errors but I'll start a long selftest just to be sure. Although if it is a failing hard drive then the raid driver should kick it out of the array. Your system was a laptop with just one drive, right?
It was actually a single SATA drive on a desktop system, but yes, it was not in a RAID configuration. Are you running the selftest on both drives? It would be interesting to see the results of that...
Alfred
On 3/29/07, Alfred von Campe alfred@110.net wrote:
On Mar 29, 2007, at 7:22, Rasmus Back wrote:
Thanks for the pointer! The smart logs for my drives don't show any errors but I'll start a long selftest just to be sure. Although if it is a failing hard drive then the raid driver should kick it out of the array. Your system was a laptop with just one drive, right?
It was actually a single SATA drive on a desktop system, but yes, it was not in a RAID configuration. Are you running the selftest on both drives? It would be interesting to see the results of that...
Well this is interesting: smartctl -H /dev/sda returns "SMART Health Status: OK". But when I try to start a test with smartctl -t long /dev/sda I immediately get "Extended Background Self Test Failed". Apparently my drives don't work with smartctl since smartctl -l error /dev/sda returns:
Error Counter logging not supported
Error Events logging not supported
I checked the BIOS and SMART reporting is turned on, although from the description of this option it sounds like this controls whether failures are reported during the BIOS boot-up.
Rasmus
Rasmus Back wrote:
On 3/29/07, Alfred von Campe alfred@110.net wrote:
On Mar 29, 2007, at 7:22, Rasmus Back wrote:
Thanks for the pointer! The smart logs for my drives don't show any errors but I'll start a long selftest just to be sure. Although if it is a failing hard drive then the raid driver should kick it out of the array. Your system was a laptop with just one drive, right?
It was actually a single SATA drive on a desktop system, but yes, it was not in a RAID configuration. Are you running the selftest on both drives? It would be interesting to see the results of that...
Well this is interesting: smartctl -H /dev/sda returns "SMART Health Status: OK". But when I try to start a test with smartctl -t long /dev/sda I immediately get "Extended Background Self Test Failed". Apparently my drives don't work with smartctl since smartctl -l error /dev/sda returns:
Error Counter logging not supported
Error Events logging not supported
I checked the BIOS and SMART reporting is turned on, although from the description of this option it sounds like this controls whether failures are reported during the BIOS boot-up.
Rasmus
Rasmus, I don't know what motherboard and kernel you are using, but I found that using the stock Centos 4.4 kernel-utils package (which includes smartctl), smartctl did not work with my Sata II drive. So I downloaded, built, and installed smartmontools-5.37, which did. I am also running kernel 2.20.1 -- not sure if that is a factor. Plus my sata driver module is sata_nv (Nvidia).
--peter gross
Peter Gross wrote:
Rasmus Back wrote:
Rasmus, I don't know what motherboard and kernel you are using, but I found that using the stock Centos 4.4 kernel-utils package (which includes smartctl), smartctl did not work with my Sata II drive. So I downloaded, built, and installed smartmontools-5.37, which did. I am also running kernel 2.20.1 -- not sure if that is a factor. Plus my sata driver module is sata_nv (Nvidia).
--peter gross _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
You can also use the '-d ata' switch to smartctl to get it to work with SATA devices. To wit:
[root@jaybird ~]# smartctl -H /dev/sda smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/
Request Sense failed, [Input/output error]
Now the same command with the '-d ata' switch:
[root@jaybird ~]# smartctl -H -d ata /dev/sda smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
I picked this up from a thread a while back on this list.
Hope that helps!
Hi-
Rasmus Back:
I have a Dell SC440 running Centos 4.4. It has two 500GB disks in a RAID1 array using linux software raid (md1 is / and md0 is /boot). Recently the root file system was remounted read-only for some reason.
I had the very same behavior on a RAIDI set on an Areca controller (but no software RAID or VMWare). Scary.
I believe if you have the WD drives, it could also be a hardware issue... look:
http://www.theinquirer.net/default.aspx?article=37188 http://support.wdc.com/download/index.asp?cxml=n&pid=15&swid=57
Affects 500GB drives. In that particular situation, dmesg said:
EXT3-fs error gone raid volume device dm-0 error reading directory
Then the / volume remounted itself RO while also reporting 100% usage..
On 3/29/07, centos44 centos44@hastek.net wrote:
Hi-
Rasmus Back:
I have a Dell SC440 running Centos 4.4. It has two 500GB disks in a RAID1 array using linux software raid (md1 is / and md0 is /boot). Recently the root file system was remounted read-only for some reason.
I had the very same behavior on a RAIDI set on an Areca controller (but no software RAID or VMWare). Scary.
I believe if you have the WD drives, it could also be a hardware issue... look:
http://www.theinquirer.net/default.aspx?article=37188 http://support.wdc.com/download/index.asp?cxml=n&pid=15&swid=57
Affects 500GB drives. In that particular situation, dmesg said:
EXT3-fs error gone raid volume device dm-0 error reading directory
Then the / volume remounted itself RO while also reporting 100% usage..
I think I might be safe from the WD drive problem, my drives appear to be hitachi desktstars (model HDS725050KLA360). But isn't the Areca driver is different from the mpt one? So it would seem other drivers have the same problem as well.
Rasmus