[CentOS] 3ware disk failure -> hang
Adam Gibson
agibson at ptm.com
Fri Jan 6 19:47:15 UTC 2006
Joshua Baker-LePain wrote:
> I've got an i386 server running centos 4.2 with 3 3ware controllers in
> it -- an 8006-2 for the system disks and 2 7500-8s. On the 7500s, I'm
> running an all software RAID50. This morning I came in to find the
> system hung. Turns out a disk went overnight on one of the 7500s, and
> rather than a graceful failover I got this:
>
> Jan 6 01:03:58 $SERVER kernel: 3w-xxxx: scsi2: Command failed: status =
> 0xc7,flags = 0x40, unit #3.
> Jan 6 01:04:02 $SERVER kernel: 3w-xxxx: scsi2: AEN: ERROR: Drive error:
> Port #3.
> Jan 6 01:04:10 $SERVER 3w-xxxx[2781]: ERROR: Drive error encountered on
> port 3 on controller ID:2. Check cables and drives for media errors. (0xa)
> Jan 6 01:04:10 $SERVER kernel: Debug: sleeping function called from
> invalid context at include/asm/uaccess.h:556
> Jan 6 01:04:10 $SERVER kernel: in_atomic():0[expected: 0],
> irqs_disabled():1
> Jan 6 01:04:10 $SERVER kernel: [<c011fbe9>] __might_sleep+0x7d/0x88
> Jan 6 01:04:10 $SERVER kernel: [<f885f056>] tw_ioctl+0x478/0xb07
> [3w_xxxx]
> Jan 6 01:04:10 $SERVER kernel: [<c011fec9>]
> autoremove_wake_function+0x0/0x2d
> Jan 6 01:04:10 $SERVER kernel: [<f883f905>] scsi_done+0x0/0x16 [scsi_mod]
> Jan 6 01:04:10 $SERVER kernel: [<f8860529>] tw_scsi_queue+0x163/0x1f1
> [3w_xxxx]
> Jan 6 01:04:10 $SERVER kernel: [<f883f748>]
> scsi_dispatch_cmd+0x1e9/0x24f [scsi_mod]
> Jan 6 01:04:10 $SERVER kernel: [<f884417e>]
> scsi_request_fn+0x297/0x30d [scsi_mod]
> Jan 6 01:04:10 $SERVER kernel: [<c0221f68>]
> __generic_unplug_device+0x2b/0x2d
> Jan 6 01:04:10 $SERVER kernel: [<c0221f7f>]
> generic_unplug_device+0x15/0x21
> Jan 6 01:04:10 $SERVER kernel: [<c022288d>] blk_execute_rq+0x88/0xb0
> Jan 6 01:04:10 $SERVER kernel: [<c022096e>] elv_set_request+0xa/0x17
> Jan 6 01:04:10 $SERVER kernel: [<c022251f>] get_request+0x1de/0x1e8
> Jan 6 01:04:10 $SERVER kernel: [<c01a97c3>] task_has_capability+0x4a/0x52
> Jan 6 01:04:10 $SERVER kernel: [<c0225cd5>] sg_scsi_ioctl+0x2bf/0x3c1
> Jan 6 01:04:10 $SERVER kernel: [<c02261aa>] scsi_cmd_ioctl+0x3d3/0x475
> Jan 6 01:04:10 $SERVER kernel: [<c014d41b>] handle_mm_fault+0xbd/0x175
> Jan 6 01:04:10 $SERVER kernel: [<c011ad67>] do_page_fault+0x1ae/0x5c6
> Jan 6 01:04:10 $SERVER kernel: [<c014e4a6>] vma_adjust+0x286/0x2d6
> Jan 6 01:04:10 $SERVER kernel: [<f88228ea>] sd_ioctl+0xb3/0xd4 [sd_mod]
> Jan 6 01:04:10 $SERVER kernel: [<c02246e8>] blkdev_ioctl+0x328/0x334
>
> The 3ware boards are running the latest firmware and I'm using the stock
> driver in the kernel. I *did* have to upgrade mdadm to 1.12.0, as the
> stock version doesn't support stacked arrays. 3dmd is running.
>
> Any ideas as to what I can do to prevent this in the future? Having the
> system hang every time a disk dies is, well, less than optimal.
I have a similar problem with using the hardware raid(mirror). Every
time 3dmd started a scheduled verify at midnight... anywhere from 0 to
26 minutes later the kernel would crash. This happened every night at
12. I finally disabled the verify task in 3dmd and the crashes stopped.
I now just use smartd to do extended tests which do not show any
problems with the disks. The crash dump and log indicates that port0 is
bad though.
I have the crash dumps and it is reproducible if I enable verify
again... Anyone know of a way to get to the bottom of the crash and
find a fix? I keep getting the feeling of "See... you should have
bought RHEL to get support!". Too expensive for my use of this system
though.
3w-xxxx: scsi0: Command failed: status = 0xc4, flags = 0x3b, unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization complete: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.
3w-xxxx: scsi0: AEN: ERROR: Verify failed: Port #0.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0.
Debug: sleeping function called from invalid context at
include/asm/uaccess.h:556
in_atomic():0[expected: 0], irqs_disabled():1
[<c011df50>] __might_sleep+0x7d/0x89
[<d0840550>] tw_ioctl+0x466/0xd80 [3w_xxxx]
[<d0858dfc>] scsi_done+0x0/0x16 [scsi_mod]
[<d08422a0>] tw_scsi_queue+0x222/0x312 [3w_xxxx]
[<d0858bee>] scsi_dispatch_cmd+0x2f6/0x3ad [scsi_mod]
[<d085eab2>] scsi_request_fn+0x385/0x5b4 [scsi_mod]
[<c024ee5a>] __generic_unplug_device+0x2b/0x2d
[<c024eed7>] generic_unplug_device+0x7b/0xe0
[<c024fbd6>] blk_execute_rq+0xbb/0xe3
[<c011e57b>] autoremove_wake_function+0x0/0x2d
[<c02573c5>] cfq_set_request+0x33/0x6b
[<c0257392>] cfq_set_request+0x0/0x6b
[<c024db99>] elv_set_request+0xa/0x17
[<c024f77b>] get_request+0x395/0x39f
[<c0253181>] sg_scsi_ioctl+0x2bf/0x3c1
[<c0253656>] scsi_cmd_ioctl+0x3d3/0x475
[<c0157a48>] do_no_page+0x55/0x3bf
[<c018103f>] dput+0x33/0x423
[<c0157f92>] handle_mm_fault+0xd5/0x1fd
[<c011a8ed>] do_page_fault+0x1ac/0x4dc
[<d081a8f4>] sd_ioctl+0xb6/0xd7 [sd_mod]
[<c0251d86>] blkdev_ioctl+0x32b/0x337
[<c0171c2e>] block_ioctl+0x11/0x13
[<c017c0f1>] sys_ioctl+0x297/0x336
[<c030f8cb>] syscall_call+0x7/0xb
[<c030007b>] xfrm_policy_gc_kill+0x39/0x68
Debug: sleeping function called from invalid context at
include/asm/uaccess.h:531
in_atomic():0[expected: 0], irqs_disabled():1
[<c011df50>] __might_sleep+0x7d/0x89
[<d0840b85>] tw_ioctl+0xa9b/0xd80 [3w_xxxx]
[<d0858dfc>] scsi_done+0x0/0x16 [scsi_mod]
[<d08422a0>] tw_scsi_queue+0x222/0x312 [3w_xxxx]
[<d0858bee>] scsi_dispatch_cmd+0x2f6/0x3ad [scsi_mod]
[<d085eab2>] scsi_request_fn+0x385/0x5b4 [scsi_mod]
[<c024ee5a>] __generic_unplug_device+0x2b/0x2d
[<c024eed7>] generic_unplug_device+0x7b/0xe0
[<c024fbd6>] blk_execute_rq+0xbb/0xe3
[<c011e57b>] autoremove_wake_function+0x0/0x2d
[<c02573c5>] cfq_set_request+0x33/0x6b
[<c0257392>] cfq_set_request+0x0/0x6b
[<c024db99>] elv_set_request+0xa/0x17
[<c024f77b>] get_request+0x395/0x39f
[<c0253181>] sg_scsi_ioctl+0x2bf/0x3c1
[<c0253656>] scsi_cmd_ioctl+0x3d3/0x475
[<c0157a48>] do_no_page+0x55/0x3bf
[<c018103f>] dput+0x33/0x423
[<c0157f92>] handle_mm_fault+0xd5/0x1fd
[<c011a8ed>] do_page_fault+0x1ac/0x4dc
[<d081a8f4>] sd_ioctl+0xb6/0xd7 [sd_mod]
[<c0251d86>] blkdev_ioctl+0x32b/0x337
[<c0171c2e>] block_ioctl+0x11/0x13
[<c017c0f1>] sys_ioctl+0x297/0x336
[<c030f8cb>] syscall_call+0x7/0xb
[<c030007b>] xfrm_policy_gc_kill+0x39/0x68
Kernel panic - not syncing: drivers/scsi/scsi_lib.c:1245:
spin_lock(drivers/scsi/hosts.c:cf950034) already locked by
drivers/scsi/3w-xxxx.c/1950
More information about the CentOS
mailing list