"James A. Peltier" jpeltier@sfu.ca writes:
| I am having an issue with an XFS filesystem shutting down under high | load with very many small files. Basically, I have around 3.5 - 4 | million files on this filesystem. New files are being written to the | FS all the time, until I get to 9-11 mln small files (35k on | average). | | at some point I get the following in dmesg: | | [2870477.695512] Filesystem "sda5": XFS internal error | xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller | 0xffffffff8826bb7d [2870477.695558] [2870477.695559] Call Trace: | [2870477.695611] [<ffffffff88262c28>] | :xfs:xfs_trans_cancel+0x5b/0xfe [2870477.695643] | [<ffffffff8826bb7d>] :xfs:xfs_mkdir+0x57c/0x5d7 [2870477.695673] | [<ffffffff8822f3f8>] :xfs:xfs_attr_get+0xbf/0xd2 [2870477.695707] | [<ffffffff88273326>] :xfs:xfs_vn_mknod+0x1e1/0x3bb [2870477.695726] | [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14 [2870477.695736] | [<ffffffff802230e6>] __up_read+0x19/0x7f [2870477.695764] | [<ffffffff8824f8f4>] :xfs:xfs_iunlock+0x57/0x79 [2870477.695776] | [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14 [2870477.695784] | [<ffffffff802230e6>] __up_read+0x19/0x7f [2870477.695791] | [<ffffffff80209f4c>] __d_lookup+0xb0/0xff [2870477.695803] | [<ffffffff8020cd4a>] _atomic_dec_and_lock+0x39/0x57 | [2870477.695814] [<ffffffff8022d6db>] mntput_no_expire+0x19/0x89 | [2870477.695829] [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14 | [2870477.695837] [<ffffffff802230e6>] __up_read+0x19/0x7f | [2870477.695861] [<ffffffff8824f8f4>] :xfs:xfs_iunlock+0x57/0x79 | [2870477.695887] [<ffffffff882680af>] :xfs:xfs_access+0x3d/0x46 | [2870477.695899] [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14 | [2870477.695923] [<ffffffff802df4a3>] vfs_mkdir+0xe3/0x152 | [2870477.695933] [<ffffffff802dfa79>] sys_mkdirat+0xa3/0xe4 | [2870477.695953] [<ffffffff80260295>] tracesys+0x47/0xb6 | [2870477.695963] [<ffffffff802602f9>] tracesys+0xab/0xb6 | [2870477.695977] [2870477.695985] xfs_force_shutdown(sda5,0x8) | called from line 1139 of file fs/xfs/xfs_trans.c. Return address = | 0xffffffff88262c46 [2870477.696452] Filesystem "sda5": Corruption of | in-memory data detected. Shutting down filesystem: sda5 | [2870477.696464] Please umount the filesystem, and rectify the | problem(s) | | # ls -l /store ls: /store: Input/output error ?--------- 0 root root | 0 Jan 1 1970 /store | | Filesystems is ~1T in size # df -hT /store Filesystem Type | Size Used Avail Use% Mounted on /dev/sda5 xfs 910G 142G | 769G 16% /store | | | Using CentOS 5.9 with kernel 2.6.18-348.el5xen | | | The filesystem is in a virtual machine (Xen) and on top of LVM. | | Filesystem was created using mkfs.xfs defaults with | xfsprogs-2.9.4-1.el5.centos (that's the one that comes with CentOS | 5.x by default.) | | These are the defaults with which the filesystem was created: # | xfs_info /store meta-data=/dev/sda5 isize=256 | agcount=32, agsize=7454720 blks = | sectsz=512 attr=0 data = bsize=4096 | blocks=238551040, imaxpct=25 = | sunit=0 swidth=0 blks, unwritten=1 naming =version | 2 bsize=4096 log =internal | bsize=4096 blocks=32768, version=1 | = sectsz=512 sunit=0 blks, | lazy-count=0 realtime =none extsz=4096 blocks=0, | rtextents=0 | | The problem is reproducible and I don't think it's hardware related. | The problem was reproduced on multiple servers of the same type. So, | I doubt it's a memory issue or something like that. | | Is that a known issue? If it is then what's the fix? I went through | the kernel updates for CentOS 5.10 (newer kernel), but didn't see | any xfs related fixes since CentOS 5.9 | | Any help will be greatly appreciated... | | | -- "If we really understand the problem, the answer will come out of | it, because the answer is not separate from the problem." - | Krishnamurti
Sorry, further to this, most bugs related to XFS are related to kernel bugs. I can see that you're running an older kernel and just because you don't see the bugs listed in the errata doesn't mean the bugs haven't been found as part of the backport process
So, you suggest I try my luck with the newer kernel from CentOS 5.10?
What's the proper way to open a bug for this against CentOS 5 / RHEL 5?