Hello all,
Came across my first real "problem" in all the years I've been using Linux today and I'm stumped for an answer.
I've recently built a back-up server on top of Centos 5.3 using a 3Ware 9650SE 4 port SATA card and 4 x 1.5TB drives in a RAID 5 array. I went for the 3Ware card as it has driver support already so any kernel updates won't require me to rebuild the drivers each time. My current kernel is 2.6.18-128.2.1.el5 x86_x64.
My problems started a few weeks ago with some message issues with the 3Ware card causing it to stop accessing the mounted RAID array. The only way to resolve it was a cold boot of the machine. Then today I log-in to discover the RAIDed partition has been auto un-mounted by XFS itself.
So at this point I'm not sure where the problem lies. Is it the card that is faulty/problematic or is it something with the XFS file system?
I'll take the XFS problem first as it's the most recent.
When I logged in today I found the following in my server logs:
Sep 7 21:08:19 backup kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c. Caller 0xffffffff882e356b Sep 7 21:08:19 backup kernel: Sep 7 21:08:19 backup kernel: Call Trace: Sep 7 21:08:19 backup kernel: [<ffffffff882e1c7e>] :xfs:xfs_free_ag_extent+0x19f/0x67f Sep 7 21:08:19 backup kernel: [<ffffffff882e356b>] :xfs:xfs_free_extent+0xa9/0xc9 Sep 7 21:08:19 backup kernel: [<ffffffff882f02ad>] :xfs:xfs_bmap_finish+0xf0/0x169 Sep 7 21:08:19 backup kernel: [<ffffffff8830de46>] :xfs:xfs_itruncate_finish+0x172/0x2b3 Sep 7 21:08:19 backup kernel: [<ffffffff88326f22>] :xfs:xfs_inactive+0x22e/0x821 Sep 7 21:08:19 backup kernel: [<ffffffff8832dd66>] :xfs:xfs_validate_fields+0x24/0x4b Sep 7 21:08:19 backup kernel: [<ffffffff8832e361>] :xfs:xfs_vn_unlink+0x3d/0x4f Sep 7 21:08:19 backup kernel: [<ffffffff80063bb7>] mutex_lock+0xd/0x1d Sep 7 21:08:19 backup kernel: [<ffffffff883308fc>] :xfs:xfs_fs_clear_inode+0xa5/0xec Sep 7 21:08:19 backup kernel: [<ffffffff80022861>] clear_inode+0xd2/0x123 Sep 7 21:08:19 backup kernel: [<ffffffff8002f19a>] generic_delete_inode+0xde/0x143 Sep 7 21:08:19 backup kernel: [<ffffffff8003c1f6>] do_unlinkat+0xd5/0x141 Sep 7 21:08:19 backup kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Sep 7 21:08:19 backup kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Sep 7 21:08:19 backup kernel: Sep 7 21:08:19 backup kernel: xfs_force_shutdown(sdb1,0x8) called from line 4267 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_bmap.c. Return address = 0xffffffff882f02ea Sep 7 21:08:19 backup kernel: Filesystem "sdb1": Corruption of in-memory data detected. Shutting down filesystem: sdb1 Sep 7 21:08:19 backup kernel: Please umount the filesystem, and rectify the problem(s)
Subsequent recovery was as follows:
Sep 8 10:59:40 backup kernel: xfs_force_shutdown(sdb1,0x1) called from line 424 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_rw.c. Return address = 0xffffffff883277f1 Sep 8 10:59:40 backup kernel: xfs_force_shutdown(sdb1,0x1) called from line 424 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_rw.c. Return address = 0xffffffff883277f1 Sep 8 11:00:13 backup kernel: XFS mounting filesystem sdb1 Sep 8 11:00:13 backup kernel: Starting XFS recovery on filesystem: sdb1 (logdev: internal) Sep 8 11:00:14 backup kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c. Caller 0xffffffff882e356b Sep 8 11:00:14 backup kernel: Sep 8 11:00:14 backup kernel: Call Trace: Sep 8 11:00:14 backup kernel: [<ffffffff882e1c7e>] :xfs:xfs_free_ag_extent+0x19f/0x67f Sep 8 11:00:14 backup kernel: [<ffffffff882e356b>] :xfs:xfs_free_extent+0xa9/0xc9 Sep 8 11:00:14 backup kernel: [<ffffffff883180e5>] :xfs:xlog_recover_finish+0x15a/0x244 Sep 8 11:00:14 backup kernel: [<ffffffff8831bb00>] :xfs:xfs_mountfs+0xa24/0xc30 Sep 8 11:00:14 backup kernel: [<ffffffff8000c394>] _atomic_dec_and_lock+0x39/0x57 Sep 8 11:00:14 backup kernel: [<ffffffff88321748>] :xfs:xfs_mount+0x762/0x83b Sep 8 11:00:14 backup kernel: [<ffffffff88330c79>] :xfs:xfs_fs_fill_super+0x0/0x1e3 Sep 8 11:00:14 backup kernel: [<ffffffff88330cf7>] :xfs:xfs_fs_fill_super+0x7e/0x1e3 Sep 8 11:00:14 backup kernel: [<ffffffff8006465d>] __down_write_nested+0x12/0x92 Sep 8 11:00:14 backup kernel: [<ffffffff800e72c1>] get_filesystem+0x12/0x3b Sep 8 11:00:14 backup kernel: [<ffffffff800de9b5>] sget+0x365/0x377 Sep 8 11:00:14 backup kernel: [<ffffffff800de301>] set_bdev_super+0x0/0xf Sep 8 11:00:14 backup kernel: [<ffffffff800de310>] test_bdev_super+0x0/0xd Sep 8 11:00:14 backup kernel: [<ffffffff800df2c4>] get_sb_bdev+0x10a/0x164 Sep 8 11:00:14 backup kernel: [<ffffffff800dec61>] vfs_kern_mount+0x93/0x11a Sep 8 11:00:14 backup kernel: [<ffffffff800ded2a>] do_kern_mount+0x36/0x4d Sep 8 11:00:14 backup kernel: [<ffffffff800e8bfb>] do_mount+0x6a7/0x717 Sep 8 11:00:14 backup kernel: [<ffffffff8002c7ef>] mntput_no_expire+0x19/0x89 Sep 8 11:00:14 backup kernel: [<ffffffff8000e87a>] link_path_walk+0xd3/0xe5 Sep 8 11:00:14 backup kernel: [<ffffffff80066bd2>] do_page_fault+0x4fe/0x830 Sep 8 11:00:14 backup kernel: [<ffffffff800076c0>] find_get_page+0x21/0x50 Sep 8 11:00:14 backup kernel: [<ffffffff80013445>] filemap_nopage+0x188/0x322 Sep 8 11:00:14 backup kernel: [<ffffffff80008b8f>] __handle_mm_fault+0x51d/0xe5c Sep 8 11:00:14 backup kernel: [<ffffffff8002331e>] __user_walk_fd+0x41/0x4c Sep 8 11:00:14 backup kernel: [<ffffffff800c877d>] zone_statistics+0x3e/0x6d Sep 8 11:00:14 backup kernel: [<ffffffff8000f10a>] __alloc_pages+0x65/0x2ce Sep 8 11:00:14 backup kernel: [<ffffffff800e0e13>] sys_readlinkat+0x98/0xa9 Sep 8 11:00:14 backup kernel: [<ffffffff8004bf75>] sys_mount+0x8a/0xcd Sep 8 11:00:14 backup kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Sep 8 11:00:14 backup kernel: Sep 8 11:00:14 backup kernel: Ending XFS recovery on filesystem: sdb1 (logdev: internal) Sep 8 11:03:45 backup kernel: XFS mounting filesystem sdb1
The 3Ware card issue occured about a month ago but hasn't returned since:
Aug 9 05:56:32 backup kernel: sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card. Aug 9 05:56:35 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 05:57:37 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence. Aug 9 05:58:19 backup kernel: 3w-9xxx: scsi4: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0. Aug 9 05:58:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 05:59:49 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 05:59:49 backup kernel: sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card. Aug 9 06:00:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:01:19 backup last message repeated 2 times Aug 9 06:01:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:01:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset. Aug 9 06:01:49 backup kernel: sd 4:0:0:0: scsi: Device offlined - not ready after error recovery Aug 9 06:01:49 backup last message repeated 6 times Aug 9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device Aug 9 06:01:49 backup last message repeated 7 times Aug 9 06:01:49 backup kernel: sd 4:0:0:0: SCSI error: return code = 0x00010000 Aug 9 06:01:49 backup kernel: end_request: I/O error, dev sdb, sector 4394574486 Aug 9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device Aug 9 06:01:49 backup last message repeated 2 times Aug 9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x105efe274 ("xlog_iodone") error 5 buf count 30208 Aug 9 06:01:49 backup kernel: xfs_force_shutdown(sdb1,0x2) called from line 956 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_log.c. Return address = 0xffffffff88313780 Aug 9 06:01:49 backup kernel: Filesystem "sdb1": Log I/O Error Detected. Shutting down filesystem: sdb1 Aug 9 06:01:49 backup kernel: Please umount the filesystem, and rectify the problem(s) Aug 9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device Aug 9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x1446a8 ("xfs_trans_read_buf") error 5 buf count 4096 Aug 9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x1c16a8 ("xfs_trans_read_buf") error 5 buf count 4096 Aug 9 06:02:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 06:03:04 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 06:03:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:04:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:04:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 06:05:19 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 06:05:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:06:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:06:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 06:07:34 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 06:08:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:08:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:08:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
repeat last three lines ad infinitum.
Can anyone shed any light as to what the issue might be? Is it XFS or is it the card? I went with XFS as all the research I did indicated that it gave the best performance for the millions of small files in the backups we run and doesn't have any many issues with hardlinks as EXT3.
- J
_________________________________________________________________ Share your memories online with anyone you want. http://www.microsoft.com/middleeast/windows/windowslive/products/photos-shar...
. . wrote:
Can anyone shed any light as to what the issue might be? Is it XFS or is it the card? I went with XFS as all the research I did indicated that it gave the best performance for the millions of small files in the backups we run and doesn't have any many issues with hardlinks as EXT3.
My guess is that a hardware problem (no idea what) back in August caused some corruption in the file system.
Did you run xfs_repair after the problem in August?
Have you run xfs_repair after the most recent XFS shutdown?
James Pearson