[CentOS-virt] LVM->DRBD->Xen Kernel panic after DRBD connection broken

Mon Dec 22 12:49:26 UTC 2008
Maros TIMKO <timko at pobox.sk>

Hi all!

We are testing a setup of Xen virtualisation platform using CentOS distribution DRBD 8.2.6. We are having kernel panics and reboots of the primary node just seconds after we plug out the dedicated DRBD (crossover) connection. The failure is occuring all the time when we pull out the cable if DRBD devices are primary and Xen VMs are running. I thought upgrade/downgrade could solve it, but 8.0.13, 8.0.14, 8.2.7, 8.3 are acting exactly the same way. So it seems like the failure is not DRBD-related but more into Xen/xenified kernel. I have not found CentOS kernel related user list so I am trying this one.
I would like to ask the audience if anyone has the same experience or if there are some hints, how to solve such issue.

Our setup uses PV -> LVM -> DRBD -> Xen hierarchy.
Do you think we could solve it if we would change it into PV -> DRBD -> LVM -> Xen?
Dell PowerEdge 1950 with 2 Broadcom bnx2 NICs
CentOS 5.2: Linux 2.6.18-92.1.18.el5xen #1 SMP Wed Nov 12 09:48:10 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
The console output using DRBD 8.3:
(XEN) Freed 100kB init memory.
kernel direct mapping tables up to f32be000 @ 1646000-2584000
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.
Bridge firewalling registered
virbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature.
xenbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature.
audit(1229950558.868:3): dev=vif0.0 prom=256 old_prom=0 auid=4294967295 ses=4294967295
audit(1229950561.088:4): dev=peth0 prom=256 old_prom=0 auid=4294967295 ses=4294967295
audit(1229950585.285:5): dev=vif1.0 prom=256 old_prom=0 auid=4294967295 ses=4294967295
kernel direct mapping tables up to 20800000 @ d7b000-f87000
blkback: ring-ref 8, event-channel 6, protocol 1 (x86_64-abi)
blkback: ring-ref 9, event-channel 7, protocol 1 (x86_64-abi)
blkback: ring-ref 10, event-channel 8, protocol 1 (x86_64-abi)
blkback: ring-ref 11, event-channel 9, protocol 1 (x86_64-abi)
drbd3: PingAck did not arrive in time.
drbd3: short read expecting header on sock: r=-512
drbd2: PingAck did not arrive in time.
drbd2: short read expecting header on sock: r=-512
drbd0: PingAck did not arrive in time.
drbd0: short read expecting header on sock: r=-512
drbd1: PingAck did not arrive in time.
drbd1: short read expecting header on sock: r=-512
drbd3: helper command: /sbin/drbdadm fence-peer minor-3 exit code 5 (0x500)
drbd2: helper command: /sbin/drbdadm fence-peer minor-2 exit code 5 (0x500)
drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 5 (0x500)
drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 5 (0x500)
Unable to handle kernel paging request at ffff8800eabba000 RIP:
[<ffffffff802124c2>] csum_partial+0x219/0x4bc
PGD 1646067 PUD 1c4a067 PMD 1da0067 PTE 0
Oops: 0000 [1] SMP
last sysfs file: /module/drbd/parameters/cn_idx
Modules linked in: xt_physdev netloop netbk blktap blkbk bridge drbd(U) ipv6 xfrm_nalgo crypto_api ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport ide_cd e1000e bnx2 shpchp cdrom i5000_edac edac_mc serio_raw sg pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: G      2.6.18-92.1.18.el5xen #1
RIP: e030:[<ffffffff802124c2>]  [<ffffffff802124c2>] csum_partial+0x219/0x4bc
RSP: e02b:ffff880009df3b78  EFLAGS: 00010202
RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff8800eabba040
RDX: 0000000000000000 RSI: 0000000000000588 RDI: ffff8800eabba000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000015
R10: 0000000000000016 R11: 00000000000000b1 R12: 0000000000000054
R13: 0000000000000054 R14: ffff8800e9280670 R15: 00000000ce876505
FS:  00002b4db957e340(0000) GS:ffffffff805af280(0000) knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b
Process swapper (pid: 0, threadinfo ffff880001454000, task ffff8800016260c0)
Stack:  0000000000000588  0000000000000588  ffffffff8023d16d  2ea7df79eaa7c080
0000000000000020  00000040ef6760cc  ffff8800000005dc  0000000000000001
ffff8800e9280670  ffff8800ef6760cc
Call Trace:
<IRQ>  [<ffffffff8023d16d>] skb_checksum+0x123/0x271
[<ffffffff8040a3d9>] skb_checksum_help+0x71/0xd0
[<ffffffff8831233e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
[<ffffffff882ee50d>] :ip_conntrack:ip_conntrack_in+0x374/0x46a
[<ffffffff883126cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
[<ffffffff802351ae>] nf_iterate+0x41/0x7d
[<ffffffff80428040>] dst_output+0x0/0xe
[<ffffffff802588e4>] nf_hook_slow+0x58/0xbc
[<ffffffff80428040>] dst_output+0x0/0xe
[<ffffffff80235662>] ip_queue_xmit+0x431/0x4a1
[<ffffffff80222990>] tcp_transmit_skb+0x64a/0x682
[<ffffffff804320f4>] tcp_retransmit_skb+0x53d/0x638
[<ffffffff8043362a>] tcp_write_timer+0x0/0x699
[<ffffffff80433aa2>] tcp_write_timer+0x478/0x699
[<ffffffff80292b1e>] run_timer_softirq+0x13f/0x1c6
[<ffffffff802127c7>] __do_softirq+0x62/0xde
[<ffffffff80260da0>] call_softirq+0x1c/0x27c
[<ffffffff8026dcd2>] do_softirq+0x31/0x98
[<ffffffff8026db4d>] do_IRQ+0xec/0xf5
[<ffffffff803a0a98>] evtchn_do_upcall+0x86/0xe0
[<ffffffff802608d2>] do_hypervisor_callback+0x1e/0x2c
<EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
[<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
[<ffffffff8026f139>] raw_safe_halt+0x84/0xa8
[<ffffffff8026c683>] xen_idle+0x38/0x4a
[<ffffffff8024aa45>] cpu_idle+0x97/0xba

Code: 4c 03 07 4c 13 47 08 4c 13 47 10 4c 13 47 18 4c 13 47 20 4c
RIP  [<ffffffff802124c2>] csum_partial+0x219/0x4bc
RSP <ffff880009df3b78>
CR2: ffff8800eabba000
<0>Kernel panic - not syncing: Fatal exception
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.