[CentOS] 3ware disk failure -> hang

Fri Jan 6 19:47:15 UTC 2006
Adam Gibson <agibson at ptm.com>

Joshua Baker-LePain wrote:
> I've got an i386 server running centos 4.2 with 3 3ware controllers in 
> it -- an 8006-2 for the system disks and 2 7500-8s.  On the 7500s, I'm 
> running an all software RAID50.  This morning I came in to find the 
> system hung.  Turns out a disk went overnight on one of the 7500s, and 
> rather than a graceful failover I got this:
> 
> Jan  6 01:03:58 $SERVER kernel: 3w-xxxx: scsi2: Command failed: status = 
> 0xc7,flags = 0x40, unit #3.
> Jan  6 01:04:02 $SERVER kernel: 3w-xxxx: scsi2: AEN: ERROR: Drive error: 
> Port #3.
> Jan  6 01:04:10 $SERVER 3w-xxxx[2781]: ERROR: Drive error encountered on 
> port 3 on controller ID:2. Check cables and drives for media errors. (0xa)
> Jan  6 01:04:10 $SERVER kernel: Debug: sleeping function called from 
> invalid context at include/asm/uaccess.h:556
> Jan  6 01:04:10 $SERVER kernel: in_atomic():0[expected: 0], 
> irqs_disabled():1
> Jan  6 01:04:10 $SERVER kernel:  [<c011fbe9>] __might_sleep+0x7d/0x88
> Jan  6 01:04:10 $SERVER kernel:  [<f885f056>] tw_ioctl+0x478/0xb07 
> [3w_xxxx]
> Jan  6 01:04:10 $SERVER kernel:  [<c011fec9>] 
> autoremove_wake_function+0x0/0x2d
> Jan  6 01:04:10 $SERVER kernel:  [<f883f905>] scsi_done+0x0/0x16 [scsi_mod]
> Jan  6 01:04:10 $SERVER kernel:  [<f8860529>] tw_scsi_queue+0x163/0x1f1 
> [3w_xxxx]
> Jan  6 01:04:10 $SERVER kernel:  [<f883f748>] 
> scsi_dispatch_cmd+0x1e9/0x24f [scsi_mod]
> Jan  6 01:04:10 $SERVER kernel:  [<f884417e>] 
> scsi_request_fn+0x297/0x30d [scsi_mod]
> Jan  6 01:04:10 $SERVER kernel:  [<c0221f68>] 
> __generic_unplug_device+0x2b/0x2d
> Jan  6 01:04:10 $SERVER kernel:  [<c0221f7f>] 
> generic_unplug_device+0x15/0x21
> Jan  6 01:04:10 $SERVER kernel:  [<c022288d>] blk_execute_rq+0x88/0xb0
> Jan  6 01:04:10 $SERVER kernel:  [<c022096e>] elv_set_request+0xa/0x17
> Jan  6 01:04:10 $SERVER kernel:  [<c022251f>] get_request+0x1de/0x1e8
> Jan  6 01:04:10 $SERVER kernel:  [<c01a97c3>] task_has_capability+0x4a/0x52
> Jan  6 01:04:10 $SERVER kernel:  [<c0225cd5>] sg_scsi_ioctl+0x2bf/0x3c1
> Jan  6 01:04:10 $SERVER kernel:  [<c02261aa>] scsi_cmd_ioctl+0x3d3/0x475
> Jan  6 01:04:10 $SERVER kernel:  [<c014d41b>] handle_mm_fault+0xbd/0x175
> Jan  6 01:04:10 $SERVER kernel:  [<c011ad67>] do_page_fault+0x1ae/0x5c6
> Jan  6 01:04:10 $SERVER kernel:  [<c014e4a6>] vma_adjust+0x286/0x2d6
> Jan  6 01:04:10 $SERVER kernel:  [<f88228ea>] sd_ioctl+0xb3/0xd4 [sd_mod]
> Jan  6 01:04:10 $SERVER kernel:  [<c02246e8>] blkdev_ioctl+0x328/0x334
> 
> The 3ware boards are running the latest firmware and I'm using the stock 
> driver in the kernel.  I *did* have to upgrade mdadm to 1.12.0, as the 
> stock version doesn't support stacked arrays.  3dmd is running.
> 
> Any ideas as to what I can do to prevent this in the future?  Having the 
> system hang every time a disk dies is, well, less than optimal.

I have a similar problem with using the hardware raid(mirror).  Every 
time 3dmd started a scheduled verify at midnight... anywhere from 0 to 
26 minutes later the kernel would crash.  This happened every night at 
12.  I finally disabled the verify task in 3dmd and the crashes stopped. 
  I now just use smartd to do extended tests which do not show any 
problems with the disks.  The crash dump and log indicates that port0 is 
bad though.

I have the crash dumps and it is reproducible if I enable verify 
again...  Anyone know of a way to get to the bottom of the crash and 
find a fix?  I keep getting the feeling of "See... you should have 
bought RHEL to get support!".   Too expensive for my use of this system 
though.

3w-xxxx: scsi0: Command failed: status = 0xc4, flags = 0x3b, unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization complete: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: ERROR: Verify failed: Port #0.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0.
Debug: sleeping function called from invalid context at 
include/asm/uaccess.h:556
in_atomic():0[expected: 0], irqs_disabled():1
  [<c011df50>] __might_sleep+0x7d/0x89
  [<d0840550>] tw_ioctl+0x466/0xd80 [3w_xxxx]
  [<d0858dfc>] scsi_done+0x0/0x16 [scsi_mod]
  [<d08422a0>] tw_scsi_queue+0x222/0x312 [3w_xxxx]
  [<d0858bee>] scsi_dispatch_cmd+0x2f6/0x3ad [scsi_mod]
  [<d085eab2>] scsi_request_fn+0x385/0x5b4 [scsi_mod]
  [<c024ee5a>] __generic_unplug_device+0x2b/0x2d
  [<c024eed7>] generic_unplug_device+0x7b/0xe0
  [<c024fbd6>] blk_execute_rq+0xbb/0xe3
  [<c011e57b>] autoremove_wake_function+0x0/0x2d
  [<c02573c5>] cfq_set_request+0x33/0x6b
  [<c0257392>] cfq_set_request+0x0/0x6b
  [<c024db99>] elv_set_request+0xa/0x17
  [<c024f77b>] get_request+0x395/0x39f
  [<c0253181>] sg_scsi_ioctl+0x2bf/0x3c1
  [<c0253656>] scsi_cmd_ioctl+0x3d3/0x475
  [<c0157a48>] do_no_page+0x55/0x3bf
  [<c018103f>] dput+0x33/0x423
  [<c0157f92>] handle_mm_fault+0xd5/0x1fd
  [<c011a8ed>] do_page_fault+0x1ac/0x4dc
  [<d081a8f4>] sd_ioctl+0xb6/0xd7 [sd_mod]
  [<c0251d86>] blkdev_ioctl+0x32b/0x337
  [<c0171c2e>] block_ioctl+0x11/0x13
  [<c017c0f1>] sys_ioctl+0x297/0x336
  [<c030f8cb>] syscall_call+0x7/0xb
  [<c030007b>] xfrm_policy_gc_kill+0x39/0x68
Debug: sleeping function called from invalid context at 
include/asm/uaccess.h:531
in_atomic():0[expected: 0], irqs_disabled():1
  [<c011df50>] __might_sleep+0x7d/0x89
  [<d0840b85>] tw_ioctl+0xa9b/0xd80 [3w_xxxx]
  [<d0858dfc>] scsi_done+0x0/0x16 [scsi_mod]
  [<d08422a0>] tw_scsi_queue+0x222/0x312 [3w_xxxx]
  [<d0858bee>] scsi_dispatch_cmd+0x2f6/0x3ad [scsi_mod]
  [<d085eab2>] scsi_request_fn+0x385/0x5b4 [scsi_mod]
  [<c024ee5a>] __generic_unplug_device+0x2b/0x2d
  [<c024eed7>] generic_unplug_device+0x7b/0xe0
  [<c024fbd6>] blk_execute_rq+0xbb/0xe3
  [<c011e57b>] autoremove_wake_function+0x0/0x2d
  [<c02573c5>] cfq_set_request+0x33/0x6b
  [<c0257392>] cfq_set_request+0x0/0x6b
  [<c024db99>] elv_set_request+0xa/0x17
  [<c024f77b>] get_request+0x395/0x39f
  [<c0253181>] sg_scsi_ioctl+0x2bf/0x3c1
  [<c0253656>] scsi_cmd_ioctl+0x3d3/0x475
  [<c0157a48>] do_no_page+0x55/0x3bf
  [<c018103f>] dput+0x33/0x423
  [<c0157f92>] handle_mm_fault+0xd5/0x1fd
  [<c011a8ed>] do_page_fault+0x1ac/0x4dc
  [<d081a8f4>] sd_ioctl+0xb6/0xd7 [sd_mod]
  [<c0251d86>] blkdev_ioctl+0x32b/0x337
  [<c0171c2e>] block_ioctl+0x11/0x13
  [<c017c0f1>] sys_ioctl+0x297/0x336
  [<c030f8cb>] syscall_call+0x7/0xb
  [<c030007b>] xfrm_policy_gc_kill+0x39/0x68
Kernel panic - not syncing: drivers/scsi/scsi_lib.c:1245: 
spin_lock(drivers/scsi/hosts.c:cf950034) already locked by 
drivers/scsi/3w-xxxx.c/1950