I recently set up a new system to run backuppc on centOS 5 with the archive stored on a raid1 of 750 gig SATA drives created with 3 members with one specified as "missing". Once a week I add the 3rd partition, let it sync, then remove it. I've had a similar system working for a long time using a firewire drive as the 3rd member, so I don't think the raid setup is the cause of the problem. I may have had problems with the drive power connectors initially but I think that is fixed now and I can't see any hardware errors being logged (the system/log files are on different drives).
About once a week, I get an error like this, and the partition switches to read-only.
--- Feb 24 04:48:20 linbackup1 kernel: EXT3-fs error (device md3): htree_dirblock_to_tree: bad entry in directory #869973: directory entry across bloc ks - offset=0, inode=3915132787, rec_len=42464, name_len=11 Feb 24 04:48:20 linbackup1 kernel: Aborting journal on device md3. Feb 24 04:48:20 linbackup1 kernel: ext3_abort called. Feb 24 04:48:20 linbackup1 kernel: EXT3-fs error (device md3): ext3_journal_start_sb: Detected aborted journal Feb 24 04:48:20 linbackup1 kernel: Remounting filesystem read-only Feb 24 04:48:33 linbackup1 kernel: EXT3-fs error (device md3): htree_dirblock_to_tree: bad entry in directory #4212181: rec_len % 4 != 0 - offse t=0, inode=4054525677, rec_len=1183, name_len=121 ----
'fsck -y' seems to fix it up, but it keeps happening. Is this likely to be leftover cruft from the hardware issues or are there problems in ext3/raid1/sata drivers? The way backuppc stores data with millions of hardlinks in the archive it isn't really practical to copy it off, reformat, and start over.
On Mon, 2008-02-25 at 14:04 -0600, Les Mikesell wrote:
I recently set up a new system to run backuppc on centOS 5 with the archive stored on a raid1 of 750 gig SATA drives created with 3 members with one specified as "missing". Once a week I add the 3rd partition, let it sync, then remove it. I've had a similar system working for a long time using a firewire drive as the 3rd member, so I don't think the raid setup is the cause of the problem. I may have had problems with the drive power connectors initially but I think that is fixed now and I can't see any hardware errors being logged (the system/log files are on different drives).
About once a week, I get an error like this, and the partition switches to read-only.
Feb 24 04:48:20 linbackup1 kernel: EXT3-fs error (device md3): htree_dirblock_to_tree: bad entry in directory #869973: directory entry across bloc ks - offset=0, inode=3915132787, rec_len=42464, name_len=11 Feb 24 04:48:20 linbackup1 kernel: Aborting journal on device md3. Feb 24 04:48:20 linbackup1 kernel: ext3_abort called. Feb 24 04:48:20 linbackup1 kernel: EXT3-fs error (device md3): ext3_journal_start_sb: Detected aborted journal Feb 24 04:48:20 linbackup1 kernel: Remounting filesystem read-only Feb 24 04:48:33 linbackup1 kernel: EXT3-fs error (device md3): htree_dirblock_to_tree: bad entry in directory #4212181: rec_len % 4 != 0 - offse t=0, inode=4054525677, rec_len=1183, name_len=121
'fsck -y' seems to fix it up, but it keeps happening. Is this likely to be leftover cruft from the hardware issues or are there problems in ext3/raid1/sata drivers? The way backuppc stores data with millions of hardlinks in the archive it isn't really practical to copy it off, reformat, and start over.
If you use cpio, it can handle the hard links intelligently, IIRC. That may make this more feasible. Plus you can specify such things as depth to the find command feeding cpio so that even directories end up with good dates.
You can also suppress atime updates, making it both faster and non- intrusive.
HTH
William L. Maltby wrote:
'fsck -y' seems to fix it up, but it keeps happening. Is this likely to be leftover cruft from the hardware issues or are there problems in ext3/raid1/sata drivers? The way backuppc stores data with millions of hardlinks in the archive it isn't really practical to copy it off, reformat, and start over.
If you use cpio, it can handle the hard links intelligently, IIRC. That may make this more feasible. Plus you can specify such things as depth to the find command feeding cpio so that even directories end up with good dates.
Handling them intelligently and in a reasonable amount of time are 2 different things. The last time I tried to copy a backuppc archive much smaller than this I gave up after 3 days - and I've tried most of the possible file-oriented ways to do it, including cpio. Normally I raid-mirror to another drive and remove it for offsite copies, but if there are filesystem errors that fsck won't fix, they are going to be propagated in an image copy.
On Mon, 2008-02-25 at 18:11 -0600, Les Mikesell wrote:
William L. Maltby wrote:
<snip>
If you use cpio, it can handle the hard links intelligently, IIRC. That may make this more feasible. Plus you can specify such things as depth to the find command feeding cpio so that even directories end up with good dates.
Handling them intelligently and in a reasonable amount of time are 2 different things. The last time I tried to copy a backuppc archive much smaller than this I gave up after 3 days - and I've tried most of the possible file-oriented ways to do it, including cpio.
Do you remember if you used the --link or -l parameter? That's the one that says hard link when possible rather than copying. That should prevent multiple copies of the same file when multiple hard links reference them. That should be faster than not doing so if there are lots of hard links.
<snip>
Les Mikesell lesmikesell@gmail.com writes:
'fsck -y' seems to fix it up, but it keeps happening. Is this likely to be leftover cruft from the hardware issues or are there problems in ext3/raid1/sata drivers? The way backuppc stores data with millions of hardlinks in the archive it isn't really practical to copy it off, reformat, and start over.
Maybe a memory problem:
http://thread.gmane.org/gmane.comp.file-systems.ext3.user/3457/focus=3459
Nicolas KOWALSKI wrote:
Les Mikesell lesmikesell@gmail.com writes:
'fsck -y' seems to fix it up, but it keeps happening. Is this likely to be leftover cruft from the hardware issues or are there problems in ext3/raid1/sata drivers? The way backuppc stores data with millions of hardlinks in the archive it isn't really practical to copy it off, reformat, and start over.
Maybe a memory problem:
http://thread.gmane.org/gmane.comp.file-systems.ext3.user/3457/focus=3459
Back to this problem again. I did a new mkfs.ext3 and ran more than a week before hitting this again:
Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=14439505280, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 0 Mar 14 04:12:29 linbackup1 kernel: Aborting journal on device md3. Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=5260961472, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 4096
I don't see any hardware related errors, and the rest of the filesystems all seem fine, although this is the one that is busy.
Can this be related to being on a 3-member RAID1 that normally runs with one device misssing? I've run a different one that way for a couple of years on earlier kernels.
Will it hurt anything to mount the underlying partition of one of the drives directly for a while instead of using the md device?
Les Mikesell wrote:
Nicolas KOWALSKI wrote:
Les Mikesell lesmikesell@gmail.com writes:
'fsck -y' seems to fix it up, but it keeps happening. Is this likely to be leftover cruft from the hardware issues or are there problems in ext3/raid1/sata drivers? The way backuppc stores data with millions of hardlinks in the archive it isn't really practical to copy it off, reformat, and start over.
Maybe a memory problem:
http://thread.gmane.org/gmane.comp.file-systems.ext3.user/3457/focus=3459
Back to this problem again. I did a new mkfs.ext3 and ran more than a week before hitting this again:
Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=14439505280, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 0 Mar 14 04:12:29 linbackup1 kernel: Aborting journal on device md3. Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=5260961472, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 4096
I don't see any hardware related errors, and the rest of the filesystems all seem fine, although this is the one that is busy.
Is your memory ECC? If not then a memory problem can fly under the radar.
Can this be related to being on a 3-member RAID1 that normally runs with one device misssing? I've run a different one that way for a couple of years on earlier kernels.
I haven't seen any other dm-raid problems, and dm-raid is quite mature at this point. I won't say it isn't possible. Can you try running with just 2 drives for a while after this fsck and see if it happens again?
Will it hurt anything to mount the underlying partition of one of the drives directly for a while instead of using the md device?
I don't know. Depends how dm-raid keeps it's bitmap and meta-data. If it's at the end then it should work, if it's at the beginning, then you'd have to offset the mount (carefully).
You will need to be very careful when messing with the partition table to change it's type and if you recreate the RAID1 again with existing data on it (don't have a procedure for that).
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Ross S. W. Walker wrote:
Back to this problem again. I did a new mkfs.ext3 and ran more than a week before hitting this again:
Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=14439505280, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 0 Mar 14 04:12:29 linbackup1 kernel: Aborting journal on device md3. Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=5260961472, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 4096
I don't see any hardware related errors, and the rest of the filesystems all seem fine, although this is the one that is busy.
Is your memory ECC? If not then a memory problem can fly under the radar.
dmidecode says single-bit ECC
Can this be related to being on a 3-member RAID1 that normally runs with one device misssing? I've run a different one that way for a couple of years on earlier kernels.
I haven't seen any other dm-raid problems, and dm-raid is quite mature at this point. I won't say it isn't possible. Can you try running with just 2 drives for a while after this fsck and see if it happens again?
I normally run with only 2. I add the 3rd once a week long enough to sync, then unmount the partition long enough to fail and remove the 3rd, then rotate it offsite. The times it has had problems, there have only been 2 active partitions.
Will it hurt anything to mount the underlying partition of one of the drives directly for a while instead of using the md device?
I don't know. Depends how dm-raid keeps it's bitmap and meta-data. If it's at the end then it should work, if it's at the beginning, then you'd have to offset the mount (carefully).
You will need to be very careful when messing with the partition table to change it's type and if you recreate the RAID1 again with existing data on it (don't have a procedure for that).
I can mount the underlying partition without changing its type and it appears to work. I do that regularly to test the offsite copy but have always later wiped it with a new sync from the live set so I don't know if there is any harm done to using it as an md device after that.
Les Mikesell wrote:
Back to this problem again. I did a new mkfs.ext3 and ran more than a week before hitting this again:
Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=14439505280, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 0 Mar 14 04:12:29 linbackup1 kernel: Aborting journal on device md3. Mar 14 04:12:29 linbackup1 kernel: md3: rw=0, want=5260961472, limit=1465143808 Mar 14 04:12:29 linbackup1 kernel: EXT3-fs error (device md3): ext3_readdir: directory #34079247 contains a hole at offset 4096
I don't see any hardware related errors, and the rest of the filesystems all seem fine, although this is the one that is busy.
Is your memory ECC? If not then a memory problem can fly under the radar.
dmidecode says single-bit ECC
Just to clear up this old thread, the problem did turn out to be memory but it took most of a day's run of memtest86 to find it and even then it only reported soft errors. After replacing the RAM everything has been fine for several weeks.
Les Mikesell lesmikesell@gmail.com writes:
Can this be related to being on a 3-member RAID1 that normally runs with one device misssing? I've run a different one that way for a couple of years on earlier kernels.
Well, I also found this one:
http://thread.gmane.org/gmane.linux.raid/6455/focus=6908
Is your machine a "cheap" one ?
Nicolas KOWALSKI wrote:
Can this be related to being on a 3-member RAID1 that normally runs with one device misssing? I've run a different one that way for a couple of years on earlier kernels.
Well, I also found this one:
http://thread.gmane.org/gmane.linux.raid/6455/focus=6908
Is your machine a "cheap" one ?
The motherboard is a few years old but decent originally. The SATA drives and controller are recent and if it is their fault I'd expect driver-level errors, not just filesystem issues. Hmmm, the disks are in a removable-tray carrier that is also somewhat old and I can shift them to a newer trayless cage to see if that makes a difference but when problems only happen once a week it is hard to tell when/if they are fixed.