Hello all,
I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house backups. The backups are run via rsync/rsnapshot and are large in terms of the number of files: over 10 million each.
Now the machine is not particularly powerful: it is 64-bit machine, dual core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the following problem: once in awhile that XFS partition starts generating multiple I/O errors, files that had content become 0 byte, directories disappear, etc. Every time a reboot fixes that, however. So far I've looked at logs but could not find a cause of precipitating event.
Hence the question: has anyone experienced anything along those lines? What could be the cause of this?
Thanks.
Boris.
On Sun, Jan 22, 2012 at 9:06 AM, Boris Epstein borepstein@gmail.com wrote:
Hello all,
I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house backups. The backups are run via rsync/rsnapshot and are large in terms of the number of files: over 10 million each.
Now the machine is not particularly powerful: it is 64-bit machine, dual core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the following problem: once in awhile that XFS partition starts generating multiple I/O errors, files that had content become 0 byte, directories disappear, etc. Every time a reboot fixes that, however. So far I've looked at logs but could not find a cause of precipitating event.
Hence the question: has anyone experienced anything along those lines? What could be the cause of this?
Thanks.
Boris.
Correction to the above: the XFS partition is 26TB, not 16 TB (not that it should matter in the context of this particular situation).
Also, here's somethine else I have discovered. Apparently there is an potential intermittent RAID disk trouble. At least I found the following in the system log:
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B): Rebuild paused:unit=0.
...
Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B): Rebuild started:unit=0.
Even if a disk is misbehaving in a RAID6 that should not be causing I/O errors. Plus, why is it never straight after a rebbot and is always fixed by a reboot?
Be that as it may, I am still puzzled.
Boris.
On 2012-01-22, Boris Epstein borepstein@gmail.com wrote:
Also, here's somethine else I have discovered. Apparently there is an potential intermittent RAID disk trouble. At least I found the following in the system log:
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=4, unit=0.
Which 3ware controller is this? I have had lots of problems with the 3ware 9550SX controller and WD-EA[RD]S drives in a similar configuration. (Yes, I know all about the EARS drives, but they work mostly fine with the 3ware 9650 controller, so I suspect some weird interaction between the cheap drives and the old not-so-great controller. I also suspect an intermittently failing port, which I'll be testing more later this week.)
Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B): Rebuild started:unit=0.
What does your RAID look like? Are you using the 3ware's RAID6 (in which case it's not a 9550) or mdraid? Are the 3ware errors in the logs across a large number of ports or just a few? Have you used the drive tester for your drives to verify that they're still good? On all my other systems, when the controller has reported a failure, and I've run it through the tester, it's reported a failure. (Often when my 9550 reports a failure the drive passes all tests.)
If you happen to have real RAID drive models, you may also try contacting LSI support. They will steadfastly refuse to help if you have desktop-edition drives, but can be at least somewhat helpful if you have enterprise drives.
--keith
On Sun, Jan 22, 2012 at 1:34 PM, Keith Keller < kkeller@wombat.san-francisco.ca.us> wrote:
On 2012-01-22, Boris Epstein borepstein@gmail.com wrote:
Also, here's somethine else I have discovered. Apparently there is an potential intermittent RAID disk trouble. At least I found the following
in
the system log:
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR
(0x04:0x0026):
Drive ECC error reported:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR
(0x04:0x002D):
Source drive error occurred:port=4, unit=0.
Which 3ware controller is this? I have had lots of problems with the 3ware 9550SX controller and WD-EA[RD]S drives in a similar configuration. (Yes, I know all about the EARS drives, but they work mostly fine with the 3ware 9650 controller, so I suspect some weird interaction between the cheap drives and the old not-so-great controller. I also suspect an intermittently failing port, which I'll be testing more later this week.)
Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B): Rebuild started:unit=0.
What does your RAID look like? Are you using the 3ware's RAID6 (in which case it's not a 9550) or mdraid? Are the 3ware errors in the logs across a large number of ports or just a few? Have you used the drive tester for your drives to verify that they're still good? On all my other systems, when the controller has reported a failure, and I've run it through the tester, it's reported a failure. (Often when my 9550 reports a failure the drive passes all tests.)
If you happen to have real RAID drive models, you may also try contacting LSI support. They will steadfastly refuse to help if you have desktop-edition drives, but can be at least somewhat helpful if you have enterprise drives.
--keith
-- kkeller@wombat.san-francisco.ca.us
Keith, thanks!
The RAID is on the controller level. Yes, I believe the controller is a 3Ware 9xxx series - I don't recall the details right now.
What are you referring to as "drive tester"?
Boris.
On 2012-01-22, Boris Epstein borepstein@gmail.com wrote:
The RAID is on the controller level. Yes, I believe the controller is a 3Ware 9xxx series - I don't recall the details right now.
The details are important in this context--the 9550 is the problematic one (at least for me, though I've seen others with similar issues). But if it's a hardware RAID6, it's a later controller, as the 9550 doesn't support RAID6. I have had some issues with the WD-EARS drives with 96xx controllers, but much less frequently.
What are you referring to as "drive tester"?
Some drive vendors distribute their own bootable CD image, with which you can run tests specific to their drives, which can return proper error codes to help determine whether there is actually a problem on the drive. Seagate used to require you give them the diagnostic code their tester returned in order for them to accept a drive for an RMA; I don't think they do that any more, but they still distribute their tester. But it's a good way to get another indicator of a problem; if both the controller and the drive tester report an error, it's very likely that you have a bad drive; if the tester says the drive is fine, and does this for a few drives the controller reports as failed, you can suspect something behind the drives as a problem. (This is how I came to suspect the 9550: it would say my drives had failed, but the WD tester repeatedly said they were fine.)
The latest version of UBCD has the latest versions of these various testers; I recall WD, Seagate, and Hitachi testers, and I'm pretty sure there are others.
--keith
Correction to the above: the XFS partition is 26TB, not 16 TB (not that it should matter in the context of this particular situation).
Yes, it does matter:
Read this:
*[CentOS] 32-bit kernel+XFS+16.xTB filesystem = potential disaster* http://lists.centos.org/pipermail/centos/2011-April/109142.html
On Sun, Jan 22, 2012 at 2:27 PM, Miguel Medalha miguelmedalha@sapo.ptwrote:
Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
should matter in the context of this particular situation).
Yes, it does matter:
Read this:
*[CentOS] 32-bit kernel+XFS+16.xTB filesystem = potential disaster* http://lists.centos.org/**pipermail/centos/2011-April/**109142.htmlhttp://lists.centos.org/pipermail/centos/2011-April/109142.html
Miguel,
Thanks, but based on the uname output:
uname -a Linux nrims-bs 2.6.18-274.12.1.el5xen #1 SMP Tue Nov 29 14:18:21 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
this is clearly a 64-bit OS so the 32-bit limitations ought not to apply.
Boris.
uname -a Linux nrims-bs 2.6.18-274.12.1.el5xen #1 SMP Tue Nov 29 14:18:21 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
this is clearly a 64-bit OS so the 32-bit limitations ought not to apply.
Ok! Since you didn't inform us in your initial post, I thought I should ask you in order to eliminate that possible cause.
Nevertheless, it seems to me that you should have more than 3GB of RAM on a 64 bit system... Since the width of the binary word is 64 bit in this case, 3GB correspond to 1.5GB on a 32 bit system... If you have a 64 bit system you should give it space to work properly.
Nevertheless, it seems to me that you should have more than 3GB of RAM on a 64 bit system... Since the width of the binary word is 64 bit in this case, 3GB correspond to 1.5GB on a 32 bit system... If you have a 64 bit system you should give it space to work properly.
... and the fact that a reboot seems to fix the problem could also point in that direction.
On Sun, Jan 22, 2012 at 2:37 PM, Miguel Medalha miguelmedalha@sapo.ptwrote:
Nevertheless, it seems to me that you should have more than 3GB of RAM
on a 64 bit system... Since the width of the binary word is 64 bit in this case, 3GB correspond to 1.5GB on a 32 bit system... If you have a 64 bit system you should give it space to work properly.
... and the fact that a reboot seems to fix the problem could also point in that direction.
That is entirely possible. It does seem to me that some sort of a resourse accumulation is indeed occurring on the system - and I hope there is a way to stop that because filesystem I/O should be a self-balancing process.
Boris.
On Sun, Jan 22, 2012 at 2:35 PM, Miguel Medalha miguelmedalha@sapo.ptwrote:
Nevertheless, it seems to me that you should have more than 3GB of RAM on a 64 bit system... Since the width of the binary word is 64 bit in this case, 3GB correspond to 1.5GB on a 32 bit system... If you have a 64 bit system you should give it space to work properly.
Don't worry, you asked exactly the right question - but, unfortunately, it is not a 32-bit OS here that's the culprit so the situation is more involved than that.
You are right - it would indeed be desirable to have more than 3 GB of RAM on that system. However it is not obvious to me that having that little RAM should cause I/O failure? Why? That it would make the machine slow is to be expected - and especially so given that I had to jack the swap up to some 40 GB. But I do not necessarily see why I should have outright failures due solely to not having more RAM.
Boris.
You are right - it would indeed be desirable to have more than 3 GB of RAM on that system. However it is not obvious to me that having that little RAM should cause I/O failure? Why? That it would make the machine slow is to be expected - and especially so given that I had to jack the swap up to some 40 GB. But I do not necessarily see why I should have outright failures due solely to not having more RAM.
If I were you, I would be monitoring the system's memory usage. Maybe some software component has a memory leak which keeps worsening until a reboot cleans it. Also, I wouldn't discard the possibility of a physical memory problem. Can you test it?
On Sun, Jan 22, 2012 at 2:43 PM, Miguel Medalha miguelmedalha@sapo.ptwrote:
You are right - it would indeed be desirable to have more than 3 GB of RAM on that system. However it is not obvious to me that having that little RAM should cause I/O failure? Why? That it would make the machine slow is to be expected - and especially so given that I had to jack the swap up to some 40 GB. But I do not necessarily see why I should have outright failures due solely to not having more RAM.
If I were you, I would be monitoring the system's memory usage. Maybe some software component has a memory leak which keeps worsening until a reboot cleans it. Also, I wouldn't discard the possibility of a physical memory problem. Can you test it?
Miguel, thanks!
All that you are saying makes perfect sense. I have tried monitoring the system to see if any memory hogs emerge and found no obvious culprits thus far. I.e., there are processes running that consume large volumes or RAM but none that seem to keep growing overtime. Or at least I failed to locate such processes thus far.
As for testing the RAM - it is always a good test when in doubt. Too bad you have to stop your machine in order to do it and for that reason I haven't done it yet. Though this is on the short list of things to try.
Boris.
On Jan 22, 2012, at 10:00 AM, Boris Epstein borepstein@gmail.com wrote:
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B): Rebuild paused:unit=0.
From 3ware's site: 004h Rebuild failed
The 3ware RAID controller was unable to complete a rebuild operation. This error can be caused by drive errors on either the source or the destination of the rebuild. However, due to ATA drives' ability to reallocate sectors on write errors, the rebuild failure is most likely caused by the source drive of the rebuild detecting some sort of read error. The default operation of the 3ware RAID controller is to abort a rebuild if an error is encountered. If it is desired to continue on error, you can set the Continue on Source Error During Rebuild policy for the unit on the Controller Settings page in 3DM.
026h Drive ECC error reported
This AEN may be sent when a drive returns the ECC error response to an 3ware RAID controller command. The AEN may or may not be associated with a host command. Internal operations such as Background Media Scan post this AEN whenever drive ECC errors are detected.
Drive ECC errors are an indication of a problem with grown defects on a particular drive. For redundant arrays, this typically means that dynamic sector repair would be invoked (see AEN 023h). For non-redundant arrays (JBOD, RAID 0 and degraded arrays), drive ECC errors result in the 3ware RAID controller returning failed status to the associated host command.
Sounds awfully like a hardware error on one of the drives. Replace the failed drive and try rebuilding.
-Ross
On Jan 22, 2012, at 4:41 PM, Ross Walker rswwalker@gmail.com wrote:
On Jan 22, 2012, at 10:00 AM, Boris Epstein borepstein@gmail.com wrote:
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B): Rebuild paused:unit=0.
From 3ware's site: 004h Rebuild failed
The 3ware RAID controller was unable to complete a rebuild operation. This error can be caused by drive errors on either the source or the destination of the rebuild. However, due to ATA drives' ability to reallocate sectors on write errors, the rebuild failure is most likely caused by the source drive of the rebuild detecting some sort of read error. The default operation of the 3ware RAID controller is to abort a rebuild if an error is encountered. If it is desired to continue on error, you can set the Continue on Source Error During Rebuild policy for the unit on the Controller Settings page in 3DM.
026h Drive ECC error reported
This AEN may be sent when a drive returns the ECC error response to an 3ware RAID controller command. The AEN may or may not be associated with a host command. Internal operations such as Background Media Scan post this AEN whenever drive ECC errors are detected.
Drive ECC errors are an indication of a problem with grown defects on a particular drive. For redundant arrays, this typically means that dynamic sector repair would be invoked (see AEN 023h). For non-redundant arrays (JBOD, RAID 0 and degraded arrays), drive ECC errors result in the 3ware RAID controller returning failed status to the associated host command.
Sounds awfully like a hardware error on one of the drives. Replace the failed drive and try rebuilding.
This error code does not bode well.
02Dh Source drive error occurred
If an error is encountered during a rebuild operation, this AEN is generated if the error was on a source drive of the rebuild. Knowing if the error occurred on the source or the destination of the rebuild is useful for troubleshooting.
It's possible the whole RAID6 is corrupt.
-Ross
Now the machine is not particularly powerful: it is 64-bit machine, dual core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the following problem: once in awhile that XFS partition starts generating multiple I/O errors, files that had content become 0 byte, directories disappear, etc. Every time a reboot fixes that, however. So far I've looked at logs but could not find a cause of precipitating event.
Is the CentOS you are running a 64 bit one?
The reason I am asking this is because the use of XFS under a 32 bit OS is NOT recommended. If you search this list's archives you will find some discussion about this subject.
I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house backups. The backups are run via rsync/rsnapshot and are large in terms of the number of files: over 10 million each.
Now the machine is not particularly powerful: it is 64-bit machine, dual core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the following problem: once in awhile that XFS partition starts generating multiple I/O errors, files that had content become 0 byte, directories disappear, etc. Every time a reboot fixes that, however. So far I've looked at logs but could not find a cause of precipitating event.
Hence the question: has anyone experienced anything along those lines? What could be the cause of this?
In every situation like this that I have seen, it was hardware that never had adequate memory provisioned.
Another consideration is you almost certainly wont be able to run a repair on that fs with so little ram.
Finally, it would be interesting to know how you architected the storage hardware. Hardware raid, BBC, drive cache status, barrier status etc...
On Sun, Jan 22, 2012 at 2:56 PM, Joseph L. Casale <jcasale@activenetwerx.com
wrote:
I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house backups. The backups are run via rsync/rsnapshot and are large in terms of the number of files: over 10 million each.
Now the machine is not particularly powerful: it is 64-bit machine, dual core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the following problem: once in awhile that XFS partition starts generating multiple I/O errors, files that had content become 0 byte, directories disappear, etc. Every time a reboot fixes that, however. So far I've
looked
at logs but could not find a cause of precipitating event.
Hence the question: has anyone experienced anything along those lines?
What
could be the cause of this?
In every situation like this that I have seen, it was hardware that never had adequate memory provisioned.
Another consideration is you almost certainly wont be able to run a repair on that fs with so little ram.
Finally, it would be interesting to know how you architected the storage hardware. Hardware raid, BBC, drive cache status, barrier status etc...
Joseph,
If I remember correctly I pretty much went with the defaults when I created this XFS on top of a 16-drive RAID6 configuration.
Now as far as memory - I think for the purpose of XFS repair RAM and swap ought to be the same. And I've got plenty of swap on this system. I also host an 5 TB XFS in a file there and I ran XFS repair on it and it ran within no more than 5 minutes. Now this is 20% of the larger XFS, roughly speaking.
I should try to collect the info you mentioned, though - that was a good thought, some clue might be contained in there for sure.
Thanks for your input.
Boris.