(please CC on replies, I'm not subscribed to linux-ide@)
Hi,
sata_mv (driving two 8-port Supermicro AOC-SAT2-MV8 PCI-X adapters) in 2.6.18-1.2747.el5 (RHEL/CentOS 5 beta kernel) didn't respond too well to one of the attached disks experiencing what seems to be a head crash.
Apr 4 13:52:20 duality kernel: ata5: Entering mv_eng_timeout Apr 4 13:52:20 duality kernel: mmio_base f8980000 ap f7b442dc qc f7b44cf8 scsi_cmnd e719ee00 &cmnd e719ee38 Apr 4 13:52:30 duality kernel: ata5: no sense translation for status: 0x40 Apr 4 13:52:30 duality kernel: ata5: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00 Apr 4 13:52:30 duality kernel: ata5: status=0x40 { DriveReady } Apr 4 13:52:30 duality kernel: sd 4:0:0:0: SCSI error: return code = 0x08000002 Apr 4 13:52:30 duality kernel: sde: Current: sense key: Aborted Command Apr 4 13:52:30 duality kernel: Additional sense: No additional sense information Apr 4 13:52:30 duality kernel: end_request: I/O error, dev sde, sector 684432191
At this point, the machine got into an endless loop where it would completely freeze for a couple of seconds every minute or so (busy wait in kernel space?), during which time it wouldn't respond to keyboard input, ping packets or any other input.
Every time when it unfroze after being frozen for a couple of seconds, it would spit out a similar mv_eng_timeout message as above, and it would reply to the ping packets sent to it while it was frozen (i.e. a sudden burst of ping reply packets with ping times of 1ms, 1001ms, 2001ms, 3001ms, 4001ms, 5001ms etc), which makes me think it was just spinning in kernelspace somewhere.
Apr 4 13:56:02 duality kernel: BUG: soft lockup detected on CPU#3! Apr 4 13:56:02 duality kernel: [<c04051ba>] dump_trace+0x69/0x1af Apr 4 13:56:02 duality kernel: [<c0405318>] show_trace_log_lvl+0x18/0x2c Apr 4 13:56:02 duality kernel: [<c04058cc>] show_trace+0xf/0x11 Apr 4 13:56:02 duality kernel: [<c04059c9>] dump_stack+0x15/0x17 Apr 4 13:56:02 duality kernel: [<c044d4da>] softlockup_tick+0xa6/0xb4 Apr 4 13:56:02 duality kernel: [<c042e32a>] update_process_times+0x39/0x5c Apr 4 13:56:02 duality kernel: [<c04188d4>] smp_apic_timer_interrupt+0x5c/0x64 Apr 4 13:56:02 duality kernel: [<c0404a8b>] apic_timer_interrupt+0x1f/0x24 Apr 4 13:56:02 duality kernel: DWARF2 unwinder stuck at apic_timer_interrupt+0x1f/0x24 Apr 4 13:56:02 duality kernel: Leftover inexact backtrace: Apr 4 13:56:02 duality kernel: [<c0610468>] _spin_unlock_irqrestore+0x8/0x9 Apr 4 13:56:02 duality kernel: [<f88cea77>] mv_eng_timeout+0xac/0x105 [sata_mv] Apr 4 13:56:02 duality kernel: [<f888728d>] scsi_error_handler+0x0/0x9c7 [scsi_mod] Apr 4 13:56:02 duality kernel: [<f88f8fdc>] ata_scsi_error+0x3c6/0x4be [libata] Apr 4 13:56:02 duality kernel: [<f8884217>] __scsi_iterate_devices+0x50/0x58 [scsi_mod] Apr 4 13:56:02 duality kernel: [<f888728d>] scsi_error_handler+0x0/0x9c7 [scsi_mod] Apr 4 13:56:02 duality kernel: [<f888732c>] scsi_error_handler+0x9f/0x9c7 [scsi_mod] Apr 4 13:56:02 duality kernel: [<c041e4f6>] complete+0x2b/0x3d Apr 4 13:56:02 duality kernel: [<f888728d>] scsi_error_handler+0x0/0x9c7 [scsi_mod] Apr 4 13:56:02 duality kernel: [<c0436620>] kthread+0xc0/0xec Apr 4 13:56:02 duality kernel: [<c0436560>] kthread+0x0/0xec Apr 4 13:56:02 duality kernel: [<c0404d63>] kernel_thread_helper+0x7/0x10 Apr 4 13:56:02 duality kernel: =======================
All I/O to the RAID array that this disk was a member of (12 disk 6TB software RAID6 array) froze completely. It did mark the broken disk failed, but didn't recover from the failure. I had to reboot the box with the power switch, as '/sbin/reboot -f' via ssh would also get stuck in D state.
Any other info I can give? I still have the crashed disk in case anyone wants me to do some tests with it..
thanks, Lennert
On Wed, Apr 04, 2007 at 03:25:33PM +0200, Lennert Buytenhek wrote:
sata_mv (driving two 8-port Supermicro AOC-SAT2-MV8 PCI-X adapters) in 2.6.18-1.2747.el5 (RHEL/CentOS 5 beta kernel) didn't respond too well to one of the attached disks experiencing what seems to be a head crash.
Current sata_mv driver has no EH (Error Reporting) code, so this is expected to hang your machine if HW problem occur, see my bugreport for more details:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=233211
[a bunch of lines deleted]