[CentOS] corruption of in-memory data detected (xfs)

----- Original Message -----
| "James A. Peltier" <jpeltier at sfu.ca> writes:
| 
| > | I am having an issue with an XFS filesystem shutting down under
| > | high
| > | load with very many small files. Basically, I have around 3.5 - 4
| > | million files on this filesystem. New files are being written to
| > | the
| > | FS all the time, until I get to 9-11 mln small files (35k on
| > | average).
| > | 
| > | at some point I get the following in dmesg:
| > | 
| > | [2870477.695512] Filesystem "sda5": XFS internal error
| > | xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller
| > | 0xffffffff8826bb7d [2870477.695558] [2870477.695559] Call Trace:
| > | [2870477.695611]  [<ffffffff88262c28>]
| > | :xfs:xfs_trans_cancel+0x5b/0xfe [2870477.695643]
| > | [<ffffffff8826bb7d>] :xfs:xfs_mkdir+0x57c/0x5d7 [2870477.695673]
| > | [<ffffffff8822f3f8>] :xfs:xfs_attr_get+0xbf/0xd2 [2870477.695707]
| > | [<ffffffff88273326>] :xfs:xfs_vn_mknod+0x1e1/0x3bb
| > | [2870477.695726]
| > | [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14 [2870477.695736]
| > | [<ffffffff802230e6>] __up_read+0x19/0x7f [2870477.695764]
| > | [<ffffffff8824f8f4>] :xfs:xfs_iunlock+0x57/0x79 [2870477.695776]
| > | [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14 [2870477.695784]
| > | [<ffffffff802230e6>] __up_read+0x19/0x7f [2870477.695791]
| > | [<ffffffff80209f4c>] __d_lookup+0xb0/0xff [2870477.695803]
| > | [<ffffffff8020cd4a>] _atomic_dec_and_lock+0x39/0x57
| > | [2870477.695814]  [<ffffffff8022d6db>] mntput_no_expire+0x19/0x89
| > | [2870477.695829]  [<ffffffff80264929>]
| > | _spin_lock_irqsave+0x9/0x14
| > | [2870477.695837]  [<ffffffff802230e6>] __up_read+0x19/0x7f
| > | [2870477.695861]  [<ffffffff8824f8f4>] :xfs:xfs_iunlock+0x57/0x79
| > | [2870477.695887]  [<ffffffff882680af>] :xfs:xfs_access+0x3d/0x46
| > | [2870477.695899]  [<ffffffff80264929>]
| > | _spin_lock_irqsave+0x9/0x14
| > | [2870477.695923]  [<ffffffff802df4a3>] vfs_mkdir+0xe3/0x152
| > | [2870477.695933]  [<ffffffff802dfa79>] sys_mkdirat+0xa3/0xe4
| > | [2870477.695953]  [<ffffffff80260295>] tracesys+0x47/0xb6
| > | [2870477.695963]  [<ffffffff802602f9>] tracesys+0xab/0xb6
| > | [2870477.695977] [2870477.695985] xfs_force_shutdown(sda5,0x8)
| > | called from line 1139 of file fs/xfs/xfs_trans.c.  Return address
| > | =
| > | 0xffffffff88262c46 [2870477.696452] Filesystem "sda5": Corruption
| > | of
| > | in-memory data detected.  Shutting down filesystem: sda5
| > | [2870477.696464] Please umount the filesystem, and rectify the
| > | problem(s)
| > | 
| > | # ls -l /store ls: /store: Input/output error ?--------- 0 root
| > | root
| > | 0 Jan  1  1970 /store
| > | 
| > | Filesystems is ~1T in size # df -hT /store Filesystem    Type
| > | Size  Used Avail Use% Mounted on /dev/sda5      xfs    910G  142G
| > | 769G  16% /store
| > | 
| > | 
| > | Using CentOS 5.9 with kernel 2.6.18-348.el5xen
| > | 
| > | 
| > | The filesystem is in a virtual machine (Xen) and on top of LVM.
| > | 
| > | Filesystem was created using mkfs.xfs defaults with
| > | xfsprogs-2.9.4-1.el5.centos (that's the one that comes with
| > | CentOS
| > | 5.x by default.)
| > | 
| > | These are the defaults with which the filesystem was created: #
| > | xfs_info /store meta-data=/dev/sda5              isize=256
| > | agcount=32, agsize=7454720 blks          =
| > | sectsz=512   attr=0 data     =                       bsize=4096
| > | blocks=238551040, imaxpct=25          =
| > | sunit=0      swidth=0 blks,          unwritten=1 naming
| > |    =version
| > | 2              bsize=4096 log      =internal
| > | bsize=4096   blocks=32768, version=1
| > | =                       sectsz=512   sunit=0 blks,
| > | lazy-count=0 realtime =none                   extsz=4096
| > |    blocks=0,
| > | rtextents=0
| > | 
| > | The problem is reproducible and I don't think it's hardware
| > | related.
| > | The problem was reproduced on multiple servers of the same type.
| > | So,
| > | I doubt it's a memory issue or something like that.
| > | 
| > | Is that a known issue? If it is then what's the fix? I went
| > | through
| > | the kernel updates for CentOS 5.10 (newer kernel), but didn't see
| > | any xfs related fixes since CentOS 5.9
| > | 
| > | Any help will be greatly appreciated...
| > | 
| > | 
| > | -- "If we really understand the problem, the answer will come out
| > | of
| > | it, because the answer is not separate from the problem." -
| > | Krishnamurti
| >
| > Sorry, further to this, most bugs related to XFS are related to
| > kernel
| > bugs. I can see that you're running an older kernel and just
| > because
| > you don't see the bugs listed in the errata doesn't mean the bugs
| > haven't been found as part of the backport process
| 
| So, you suggest I try my luck with the newer kernel from CentOS 5.10?
| 
| What's the proper way to open a bug for this against CentOS 5 / RHEL
| 5?

The recommendation is to always run the latest kernel before filing a bug.  Looking at the stack trace it appears that this system is doing a lot of locking, IRQ and XFS/VFS.  You're probably looking too closely for something that is XFS specific rather than something that may be SCSI/FC related or VFS related.  There have been seven CentOS 5 kernel updates since your currently running kernel, covering many facets of file systems, drivers and subsystems.

That said a way to possibly mitigate this may be to attempt to use the noatime mount option which may delay the problem.

-- 
James A. Peltier
Manager, IT Services - Research Computing Group
Simon Fraser University - Burnaby Campus
Phone   : 778-782-6573
Fax     : 778-782-3045
E-Mail  : jpeltier at sfu.ca
Website : http://www.sfu.ca/itservices

To be original seek your inspiration from unexpected sources.