On Sun, Jan 22, 2012 at 9:06 AM, Boris Epstein <borepstein at gmail.com> wrote: > Hello all, > > I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house > backups. The backups are run via rsync/rsnapshot and are large in terms of > the number of files: over 10 million each. > > Now the machine is not particularly powerful: it is 64-bit machine, dual > core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the > following problem: once in awhile that XFS partition starts generating > multiple I/O errors, files that had content become 0 byte, directories > disappear, etc. Every time a reboot fixes that, however. So far I've looked > at logs but could not find a cause of precipitating event. > > Hence the question: has anyone experienced anything along those lines? > What could be the cause of this? > > Thanks. > > Boris. > Correction to the above: the XFS partition is 26TB, not 16 TB (not that it should matter in the context of this particular situation). Also, here's somethine else I have discovered. Apparently there is an potential intermittent RAID disk trouble. At least I found the following in the system log: Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B): Rebuild paused:unit=0. ... Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B): Rebuild started:unit=0. Even if a disk is misbehaving in a RAID6 that should not be causing I/O errors. Plus, why is it never straight after a rebbot and is always fixed by a reboot? Be that as it may, I am still puzzled. Boris.