[CentOS] 3Ware 9650SE and XFS problems under Centos 5.3

Tue Sep 8 09:36:48 UTC 2009
. . <svengelska at hotmail.com>

Hello all,

Came across my first real "problem" in all the years I've been using Linux today and I'm stumped for an answer.

I've recently built a back-up server on top of Centos 5.3 using a 3Ware 9650SE 4 port SATA card and 4 x 1.5TB drives in a RAID 5 array. I went for the 3Ware card as it has driver support already so any kernel updates won't require me to rebuild the drivers each time. My current kernel is  2.6.18-128.2.1.el5 x86_x64.

My problems started a few weeks ago with some message issues with the 3Ware card causing it to stop accessing the mounted RAID array. The only way to resolve it was a cold boot of the machine. Then today I log-in to discover the RAIDed partition has been auto un-mounted by XFS itself.

So at this point I'm not sure where the problem lies. Is it the card that is faulty/problematic or is it something with the XFS file system?

I'll take the XFS problem first as it's the most recent.

When I logged in today I found the following in my server logs:

Sep  7 21:08:19 backup kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c.  Caller 0xffffffff882e356b
Sep  7 21:08:19 backup kernel:
Sep  7 21:08:19 backup kernel: Call Trace:
Sep  7 21:08:19 backup kernel:  [<ffffffff882e1c7e>] :xfs:xfs_free_ag_extent+0x19f/0x67f
Sep  7 21:08:19 backup kernel:  [<ffffffff882e356b>] :xfs:xfs_free_extent+0xa9/0xc9
Sep  7 21:08:19 backup kernel:  [<ffffffff882f02ad>] :xfs:xfs_bmap_finish+0xf0/0x169
Sep  7 21:08:19 backup kernel:  [<ffffffff8830de46>] :xfs:xfs_itruncate_finish+0x172/0x2b3
Sep  7 21:08:19 backup kernel:  [<ffffffff88326f22>] :xfs:xfs_inactive+0x22e/0x821
Sep  7 21:08:19 backup kernel:  [<ffffffff8832dd66>] :xfs:xfs_validate_fields+0x24/0x4b
Sep  7 21:08:19 backup kernel:  [<ffffffff8832e361>] :xfs:xfs_vn_unlink+0x3d/0x4f
Sep  7 21:08:19 backup kernel:  [<ffffffff80063bb7>] mutex_lock+0xd/0x1d
Sep  7 21:08:19 backup kernel:  [<ffffffff883308fc>] :xfs:xfs_fs_clear_inode+0xa5/0xec
Sep  7 21:08:19 backup kernel:  [<ffffffff80022861>] clear_inode+0xd2/0x123
Sep  7 21:08:19 backup kernel:  [<ffffffff8002f19a>] generic_delete_inode+0xde/0x143
Sep  7 21:08:19 backup kernel:  [<ffffffff8003c1f6>] do_unlinkat+0xd5/0x141
Sep  7 21:08:19 backup kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Sep  7 21:08:19 backup kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Sep  7 21:08:19 backup kernel:
Sep  7 21:08:19 backup kernel: xfs_force_shutdown(sdb1,0x8) called from line 4267 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_bmap.c.  Return address = 0xffffffff882f02ea
Sep  7 21:08:19 backup kernel: Filesystem "sdb1": Corruption of in-memory data detected.  Shutting down filesystem: sdb1
Sep  7 21:08:19 backup kernel: Please umount the filesystem, and rectify the problem(s)

Subsequent recovery was as follows:

Sep  8 10:59:40 backup kernel: xfs_force_shutdown(sdb1,0x1) called from line 424 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_rw.c.  Return address = 0xffffffff883277f1
Sep  8 10:59:40 backup kernel: xfs_force_shutdown(sdb1,0x1) called from line 424 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_rw.c.  Return address = 0xffffffff883277f1
Sep  8 11:00:13 backup kernel: XFS mounting filesystem sdb1
Sep  8 11:00:13 backup kernel: Starting XFS recovery on filesystem: sdb1 (logdev: internal)
Sep  8 11:00:14 backup kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c.  Caller 0xffffffff882e356b
Sep  8 11:00:14 backup kernel:
Sep  8 11:00:14 backup kernel: Call Trace:
Sep  8 11:00:14 backup kernel:  [<ffffffff882e1c7e>] :xfs:xfs_free_ag_extent+0x19f/0x67f
Sep  8 11:00:14 backup kernel:  [<ffffffff882e356b>] :xfs:xfs_free_extent+0xa9/0xc9
Sep  8 11:00:14 backup kernel:  [<ffffffff883180e5>] :xfs:xlog_recover_finish+0x15a/0x244
Sep  8 11:00:14 backup kernel:  [<ffffffff8831bb00>] :xfs:xfs_mountfs+0xa24/0xc30
Sep  8 11:00:14 backup kernel:  [<ffffffff8000c394>] _atomic_dec_and_lock+0x39/0x57
Sep  8 11:00:14 backup kernel:  [<ffffffff88321748>] :xfs:xfs_mount+0x762/0x83b
Sep  8 11:00:14 backup kernel:  [<ffffffff88330c79>] :xfs:xfs_fs_fill_super+0x0/0x1e3
Sep  8 11:00:14 backup kernel:  [<ffffffff88330cf7>] :xfs:xfs_fs_fill_super+0x7e/0x1e3
Sep  8 11:00:14 backup kernel:  [<ffffffff8006465d>] __down_write_nested+0x12/0x92
Sep  8 11:00:14 backup kernel:  [<ffffffff800e72c1>] get_filesystem+0x12/0x3b
Sep  8 11:00:14 backup kernel:  [<ffffffff800de9b5>] sget+0x365/0x377
Sep  8 11:00:14 backup kernel:  [<ffffffff800de301>] set_bdev_super+0x0/0xf
Sep  8 11:00:14 backup kernel:  [<ffffffff800de310>] test_bdev_super+0x0/0xd
Sep  8 11:00:14 backup kernel:  [<ffffffff800df2c4>] get_sb_bdev+0x10a/0x164
Sep  8 11:00:14 backup kernel:  [<ffffffff800dec61>] vfs_kern_mount+0x93/0x11a
Sep  8 11:00:14 backup kernel:  [<ffffffff800ded2a>] do_kern_mount+0x36/0x4d
Sep  8 11:00:14 backup kernel:  [<ffffffff800e8bfb>] do_mount+0x6a7/0x717
Sep  8 11:00:14 backup kernel:  [<ffffffff8002c7ef>] mntput_no_expire+0x19/0x89
Sep  8 11:00:14 backup kernel:  [<ffffffff8000e87a>] link_path_walk+0xd3/0xe5
Sep  8 11:00:14 backup kernel:  [<ffffffff80066bd2>] do_page_fault+0x4fe/0x830
Sep  8 11:00:14 backup kernel:  [<ffffffff800076c0>] find_get_page+0x21/0x50
Sep  8 11:00:14 backup kernel:  [<ffffffff80013445>] filemap_nopage+0x188/0x322
Sep  8 11:00:14 backup kernel:  [<ffffffff80008b8f>] __handle_mm_fault+0x51d/0xe5c
Sep  8 11:00:14 backup kernel:  [<ffffffff8002331e>] __user_walk_fd+0x41/0x4c
Sep  8 11:00:14 backup kernel:  [<ffffffff800c877d>] zone_statistics+0x3e/0x6d
Sep  8 11:00:14 backup kernel:  [<ffffffff8000f10a>] __alloc_pages+0x65/0x2ce
Sep  8 11:00:14 backup kernel:  [<ffffffff800e0e13>] sys_readlinkat+0x98/0xa9
Sep  8 11:00:14 backup kernel:  [<ffffffff8004bf75>] sys_mount+0x8a/0xcd
Sep  8 11:00:14 backup kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Sep  8 11:00:14 backup kernel:
Sep  8 11:00:14 backup kernel: Ending XFS recovery on filesystem: sdb1 (logdev: internal)
Sep  8 11:03:45 backup kernel: XFS mounting filesystem sdb1

The 3Ware card issue occured about a month ago but hasn't returned since:

Aug  9 05:56:32 backup kernel: sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.
Aug  9 05:56:35 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Aug  9 05:57:37 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
Aug  9 05:58:19 backup kernel: 3w-9xxx: scsi4: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
Aug  9 05:58:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
Aug  9 05:59:49 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Aug  9 05:59:49 backup kernel: sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card.
Aug  9 06:00:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:01:19 backup last message repeated 2 times
Aug  9 06:01:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:01:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset.
Aug  9 06:01:49 backup kernel: sd 4:0:0:0: scsi: Device offlined - not ready after error recovery
Aug  9 06:01:49 backup last message repeated 6 times
Aug  9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device
Aug  9 06:01:49 backup last message repeated 7 times
Aug  9 06:01:49 backup kernel: sd 4:0:0:0: SCSI error: return code = 0x00010000
Aug  9 06:01:49 backup kernel: end_request: I/O error, dev sdb, sector 4394574486
Aug  9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device
Aug  9 06:01:49 backup last message repeated 2 times
Aug  9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x105efe274       ("xlog_iodone") error 5 buf count 30208
Aug  9 06:01:49 backup kernel: xfs_force_shutdown(sdb1,0x2) called from line 956 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_log.c.  Return address = 0xffffffff88313780
Aug  9 06:01:49 backup kernel: Filesystem "sdb1": Log I/O Error Detected.  Shutting down filesystem: sdb1
Aug  9 06:01:49 backup kernel: Please umount the filesystem, and rectify the problem(s)
Aug  9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device
Aug  9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x1446a8       ("xfs_trans_read_buf") error 5 buf count 4096
Aug  9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x1c16a8       ("xfs_trans_read_buf") error 5 buf count 4096
Aug  9 06:02:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
Aug  9 06:03:04 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Aug  9 06:03:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:04:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:04:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
Aug  9 06:05:19 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Aug  9 06:05:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:06:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:06:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
Aug  9 06:07:34 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Aug  9 06:08:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:08:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Aug  9 06:08:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing.

repeat last three lines ad infinitum.

Can anyone shed any light as to what the issue might be? Is it XFS or is it the card? I went with XFS as all the research I did indicated that it gave the best performance for the millions of small files in the backups we run and doesn't have any many issues with hardlinks as EXT3.

- J

_________________________________________________________________
Share your memories online with anyone you want.
http://www.microsoft.com/middleeast/windows/windowslive/products/photos-share.aspx?tab=1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20090908/498f307e/attachment-0003.html>