[CentOS] weird XFS problem

Sun Jan 22 15:00:07 UTC 2012

On Sun, Jan 22, 2012 at 9:06 AM, Boris Epstein <borepstein at gmail.com> wrote:

> Hello all,
>
> I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
> backups. The backups are run via rsync/rsnapshot and are large in terms of
> the number of files: over 10 million each.
>
> Now the machine is not particularly powerful: it is 64-bit machine, dual
> core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
> following problem: once in awhile that XFS partition starts generating
> multiple I/O errors, files that had content become 0 byte, directories
> disappear, etc. Every time a reboot fixes that, however. So far I've looked
> at logs but could not find a cause of precipitating event.
>
> Hence the question: has anyone experienced anything along those lines?
> What could be the cause of this?
>
> Thanks.
>
> Boris.
>

Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
should matter in the context of this particular situation).

Also, here's somethine else I have discovered. Apparently there is an
potential intermittent RAID disk trouble. At least I found the following in
the system log:

Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
Drive ECC error reported:port=4, unit=0.
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
Source drive error occurred:port=4, unit=0.
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004):
Rebuild failed:unit=0.
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B):
Rebuild paused:unit=0.

...

Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
(0x04:0x000F): SMART threshold exceeded:port=9.
Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
(0x04:0x000F): SMART threshold exceeded:port=9.
Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B):
Rebuild started:unit=0.

Even if a disk is misbehaving in a RAID6 that should not be causing I/O
errors. Plus, why is it never straight after a rebbot and is always fixed
by a reboot?

Be that as it may, I am still puzzled.

Boris.