El lun, 23-10-2006 a las 16:28 +0400, Kirill Korotaev escribió:
J.J.Garcia,
in july 2006 you reported hit of bio.c:99 bug on CentOs kernel: http://lists.centos.org/pipermail/centos/2006-July/067539.html
Recently, we also hit the same bug, so maybe you can provide some details on:
- whether it is doesn't work for you or was fixed.
- how it was triggered
- whether you know the fast way to reproduce
Any help would be appreciated.
Thanks, Kirill
Hi Kirill,
Since then no more news from this subject, sorry about it. You can follow the full story of my testing at CentOS Bug Tracker with 1419 id, http://bugs.centos.org/view.php?id=1419
Actually i'm running the stock kernel 42.0.3 in affected host with a lil more stability appreciated than previous kernels released but not really stable in certain situations (heavy i/o load on external usb 2.0 disk).
[root@fattybox ~]# cat /etc/grub.conf | grep vmlinuz-`uname -r` kernel /vmlinuz-2.6.9-42.0.3.EL ro root=/dev/VolGroup00/LogVol00 ACPI=off vga=0x307 selinux=0
[root@fattybox ~]# dmesg | grep ACPI ACPI: Unable to locate RSDP Kernel command line: ro root=/dev/VolGroup00/LogVol00 ACPI=off vga=0x307 selinux=0 ACPI: Subsystem revision 20040816 ACPI: Interpreter disabled.
What's sure, the system was running rock solid with 4.3 release and 22.0.2 kernel. From that point, any update to kernel or pkgs to 4.4 (34.x.x to 42.x.x) i get the same issue.
The call trace (taken from 1419 id. at bug tracker and re-pasted here) is almost allways the same, and it happends normally when lot of i/o work is performed on an USB external disk, locally or via NFS/exported.
Call Trace: [<c016deb4>] bio_put+0x27/0x28 [<c016d698>] end_bio_bh_io_sync+0x33/0x37 [<c016ea55>] bio_endio+0x4f/0x54 [<c0251ce5>] __end_that_request_first+0xea/0x1ab [<d841cdd1>] scsi_end_request+0x1b/0x174 [scsi_mod] [<d841d268>] scsi_io_completion+0x20b/0x417 [scsi_mod] [<d841800c>] scsi_finish_command+0xad/0xb1 [scsi_mod] [<d8417f31>] scsi_softirq+0xba/0xc2 [scsi_mod] [<c0126a9d>] __do_softirq+0x35/0x79 [<c010934c>] do_softirq+0x46/0x4d
Don't know for you, but im using on that host an PCI/USB 2.0 card, im not having issues/tested yet same situation for hosts with native USB 2.0.
[root@fattybox ~]# lspci 00:00.0 Host bridge: VIA Technologies, Inc. VT8601 [Apollo ProMedia] (rev 05) 00:01.0 PCI bridge: VIA Technologies, Inc. VT8601 [Apollo ProMedia AGP] 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 40) 00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) 00:07.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 40) 00:07.5 Multimedia audio controller: VIA Technologies, Inc. VT82C686 AC97 Audio Controller (rev 50) 00:09.0 USB Controller: NEC Corporation USB (rev 43) 00:09.1 USB Controller: NEC Corporation USB (rev 43) 00:09.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 00:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) 01:00.0 VGA compatible controller: Trident Microsystems CyberBlade/i1 (rev 6a)
For stock kernel 42.0.3 i had 2 panics, one of them related to fs/bio.c:
[root@spoolbox crash]# grep -R 2.6.9-42.0.3.EL 192.168.0.6-2006* 192.168.0.6-2006-10-10-05:39/log:EFLAGS: 00010082 (2.6.9-42.0.3.EL) 192.168.0.6-2006-10-12-10:21/log:EFLAGS: 00010206 (2.6.9-42.0.3.EL)
[root@spoolbox crash]# tail -n 38 192.168.0.6-2006-10-12-10:21/log ------------[ cut here ]------------ kernel BUG at fs/bio.c:99! invalid operand: 0000 [#1] Modules linked in: vfat fat imm eeprom i2c_viapro nfsd exportfs nfs_acl lp netconsole netdump autofs4 via686a i2c_sensor i2c_isa i2c_dev i2c_core rfcomm l2cap lockd sunrpc ip_nat_ftp ip_conntrack_ftp ipt_MASQUERADE iptable_nat ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_multipath hci_usb bluetooth sd_mod usb_storage scsi_mod ohci_hcd ehci_hcd parport_pc parport snd_via82xx snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 8139too mii dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod CPU: 0 EIP: 0060:[<c0171399>] Not tainted VLI EFLAGS: 00010206 (2.6.9-42.0.3.EL) EIP is at bio_destructor+0x19/0x3a eax: 00000008 ebx: c2356b00 ecx: cddf1f50 edx: c036d880 esi: 00001000 edi: c0170d59 ebp: 00000000 esp: c03fff24 ds: 007b es: 007b ss: 0068 Process usb-storage (pid: 2030, threadinfo=c03ff000 task=d5c9e130) Stack: c2356b00 c01715a8 c0170d8c c2356b00 c0172149 00001000 c2356b00 00000000 c03fff6c c0258da9 c8c4386c 00000000 00000000 00005000 00002000 c8c4386c d56de720 d57c6340 00000001 d840add1 00000001 d5509560 c8c4386c d57c6340 Call Trace: [<c01715a8>] bio_put+0x27/0x28 [<c0170d8c>] end_bio_bh_io_sync+0x33/0x37 [<c0172149>] bio_endio+0x4f/0x54 [<c0258da9>] __end_that_request_first+0xea/0x1ab [<d840add1>] scsi_end_request+0x1b/0x174 [scsi_mod] [<d840b268>] scsi_io_completion+0x20b/0x417 [scsi_mod] [<d840600c>] scsi_finish_command+0xad/0xb1 [scsi_mod] [<d8405f31>] scsi_softirq+0xba/0xc2 [scsi_mod] [<c0129a8d>] __do_softirq+0x35/0x79 [<c0109446>] do_softirq+0x46/0x4d ======================= [<c01089f7>] do_IRQ+0x2b3/0x2bf [<c031983c>] common_interrupt+0x18/0x20 [<c0316019>] __down_interruptible+0x13f/0x24a [<c0120049>] default_wake_function+0x0/0xc [<c0316137>] __down_failed_interruptible+0x7/0xc [<d83f4b13>] .text.lock.usb+0xf/0x78 [usb_storage] [<c0318d7e>] ret_from_fork+0x6/0x14 [<d83f3d24>] usb_stor_control_thread+0x0/0x417 [usb_storage] [<d83f3d24>] usb_stor_control_thread+0x0/0x417 [usb_storage] [<c01041dd>] kernel_thread_helper+0x5/0xb Code: 4d 1a 00 e9 cc c9 ff ff e8 cd 4d 1a 00 e9 26 ca ff ff 53 89 c3 8b 40 10 c1 e8 1c 89 c2 c1 e2 04 81 c2 00 d8 36 c0 83 f8 05 7e 08 <0f> 0b 63 00 f0 d8 32 c0 8b 43 30 8b 52 0c e8 90 ce fd ff 89 d8
[root@spoolbox crash]# tail -n 31 192.168.0.6-2006-10-10-05:39/log kernel BUG at kernel/panic.c:75! invalid operand: 0000 [#1] Modules linked in: nls_utf8 ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic slhc vfat fat imm eeprom i2c_viapro nfsd exportfs nfs_acl lp netconsole netdump autofs4 via686a i2c_sensor i2c_isa i2c_dev i2c_core rfcomm l2cap lockd sunrpc ip_nat_ftp ip_conntrack_ftp ipt_MASQUERADE iptable_nat ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_multipath hci_usb bluetooth sd_mod usb_storage scsi_mod ohci_hcd ehci_hcd parport_pc parport snd_via82xx snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 8139too mii dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod CPU: 0 EIP: 0060:[<c0123db2>] Not tainted VLI EFLAGS: 00010082 (2.6.9-42.0.3.EL) EIP is at panic+0x47/0x142 eax: 00000065 ebx: c2356b00 ecx: c032a703 edx: d75fbe40 esi: d75fbe8c edi: d75fbe94 ebp: 00000000 esp: d75fbe48 ds: 007b es: 007b ss: 0068 Process kswapd0 (pid: 41, threadinfo=d75fb000 task=d75fc050) Stack: c2356b00 c014b87e c0322158 c032c3b3 00000295 c2356b10 00000000 d75fbe8c c036e120 c0155ba7 d75fbe94 c2356a20 00000000 c015602f 0000000e ffffffff c2356b00 00000000 00000000 c1290400 c01557c6 c036c9b4 00000286 00000000 Call Trace: [<c014b87e>] find_get_pages+0xb6/0xf4 [<c0155ba7>] pagevec_lookup+0x17/0x1d [<c015602f>] invalidate_mapping_pages+0xb2/0xc5 [<c01557c6>] __pagevec_release+0x15/0x1d [<c016eceb>] remove_inode_buffers+0x12/0x112 [<c01899a8>] prune_icache+0x1a7/0x354 [<c0189b69>] shrink_icache_memory+0x14/0x2b [<c015630b>] shrink_slab+0xf7/0x14c [<c015794b>] balance_pgdat+0x1b3/0x2cb [<c0157b1c>] kswapd+0xb9/0xbb [<c0121853>] autoremove_wake_function+0x0/0x2d [<c0318d7e>] ret_from_fork+0x6/0x14 [<c0121853>] autoremove_wake_function+0x0/0x2d [<c0157a63>] kswapd+0x0/0xbb [<c01041dd>] kernel_thread_helper+0x5/0xb Code: 40 c0 e8 3b 5d 0c 00 68 60 7f 40 c0 68 03 a7 32 c0 e8 64 0b 00 00 83 c4 0c 83 3d 1c a8 42 c0 00 75 09 83 3d 18 a8 42 c0 00 74 08 <0f> 0b 4b 00 26 a7 32 c0 31 c0 e8 53 97 ff ff 31 d2 b9 60 7f 40
Finally, to reproduce iirc i simply have to perform two r/w commands concurrently, i.e an 'dd' command writing to that external usb disk and at the same time a second r/w operation operation to that disk, i.e. via nfs or even on console, after hours/secs the condition is triggered.
I know usb external disks are used with scsi emulation from the point of view of kernel and they are not a really good approach for massive data storing (rather using scsi units mainly/directly) but i have almost a TB on that disks...
By the way i use the workaround of not writing concurrently to that disk for a long time or for long chunks... heh, no more clues unless i rescue the old 4.3 and 22.0.2 raw hd/image for that old host serving as NAS.
Don't know if all of this will be helping, anyway...
Good luck!
Jose.