Hi all!
We are testing a setup of Xen virtualisation platform using CentOS distribution DRBD 8.2.6. We are having kernel panics and reboots of the primary node just seconds after we plug out the dedicated DRBD (crossover) connection. The failure is occuring all the time when we pull out the cable if DRBD devices are primary and Xen VMs are running. I thought upgrade/downgrade could solve it, but 8.0.13, 8.0.14, 8.2.7, 8.3 are acting exactly the same way. So it seems like the failure is not DRBD-related but more into Xen/xenified kernel. I have not found CentOS kernel related user list so I am trying this one. I would like to ask the audience if anyone has the same experience or if there are some hints, how to solve such issue.
Our setup uses PV -> LVM -> DRBD -> Xen hierarchy. Do you think we could solve it if we would change it into PV -> DRBD -> LVM -> Xen? Dell PowerEdge 1950 with 2 Broadcom bnx2 NICs CentOS 5.2: Linux 2.6.18-92.1.18.el5xen #1 SMP Wed Nov 12 09:48:10 EST 2008 x86_64 x86_64 x86_64 GNU/Linux The console output using DRBD 8.3: (XEN) Freed 100kB init memory. kernel direct mapping tables up to f32be000 @ 1646000-2584000 PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved PCI: Not using MMCONFIG. Bridge firewalling registered virbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature. xenbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature. audit(1229950558.868:3): dev=vif0.0 prom=256 old_prom=0 auid=4294967295 ses=4294967295 audit(1229950561.088:4): dev=peth0 prom=256 old_prom=0 auid=4294967295 ses=4294967295 audit(1229950585.285:5): dev=vif1.0 prom=256 old_prom=0 auid=4294967295 ses=4294967295 kernel direct mapping tables up to 20800000 @ d7b000-f87000 blkback: ring-ref 8, event-channel 6, protocol 1 (x86_64-abi) blkback: ring-ref 9, event-channel 7, protocol 1 (x86_64-abi) blkback: ring-ref 10, event-channel 8, protocol 1 (x86_64-abi) blkback: ring-ref 11, event-channel 9, protocol 1 (x86_64-abi) drbd3: PingAck did not arrive in time. drbd3: short read expecting header on sock: r=-512 drbd2: PingAck did not arrive in time. drbd2: short read expecting header on sock: r=-512 drbd0: PingAck did not arrive in time. drbd0: short read expecting header on sock: r=-512 drbd1: PingAck did not arrive in time. drbd1: short read expecting header on sock: r=-512 drbd3: helper command: /sbin/drbdadm fence-peer minor-3 exit code 5 (0x500) drbd2: helper command: /sbin/drbdadm fence-peer minor-2 exit code 5 (0x500) drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 5 (0x500) drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 5 (0x500) Unable to handle kernel paging request at ffff8800eabba000 RIP: [<ffffffff802124c2>] csum_partial+0x219/0x4bc PGD 1646067 PUD 1c4a067 PMD 1da0067 PTE 0 Oops: 0000 [1] SMP last sysfs file: /module/drbd/parameters/cn_idx CPU 5 Modules linked in: xt_physdev netloop netbk blktap blkbk bridge drbd(U) ipv6 xfrm_nalgo crypto_api ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport ide_cd e1000e bnx2 shpchp cdrom i5000_edac edac_mc serio_raw sg pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Tainted: G 2.6.18-92.1.18.el5xen #1 RIP: e030:[<ffffffff802124c2>] [<ffffffff802124c2>] csum_partial+0x219/0x4bc RSP: e02b:ffff880009df3b78 EFLAGS: 00010202 RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff8800eabba040 RDX: 0000000000000000 RSI: 0000000000000588 RDI: ffff8800eabba000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000015 R10: 0000000000000016 R11: 00000000000000b1 R12: 0000000000000054 R13: 0000000000000054 R14: ffff8800e9280670 R15: 00000000ce876505 FS: 00002b4db957e340(0000) GS:ffffffff805af280(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b Process swapper (pid: 0, threadinfo ffff880001454000, task ffff8800016260c0) Stack: 0000000000000588 0000000000000588 ffffffff8023d16d 2ea7df79eaa7c080 0000000000000020 00000040ef6760cc ffff8800000005dc 0000000000000001 ffff8800e9280670 ffff8800ef6760cc Call Trace: <IRQ> [<ffffffff8023d16d>] skb_checksum+0x123/0x271 [<ffffffff8040a3d9>] skb_checksum_help+0x71/0xd0 [<ffffffff8831233e>] :iptable_nat:ip_nat_fn+0x56/0x1c3 [<ffffffff882ee50d>] :ip_conntrack:ip_conntrack_in+0x374/0x46a [<ffffffff883126cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7 [<ffffffff802351ae>] nf_iterate+0x41/0x7d [<ffffffff80428040>] dst_output+0x0/0xe [<ffffffff802588e4>] nf_hook_slow+0x58/0xbc [<ffffffff80428040>] dst_output+0x0/0xe [<ffffffff80235662>] ip_queue_xmit+0x431/0x4a1 [<ffffffff80222990>] tcp_transmit_skb+0x64a/0x682 [<ffffffff804320f4>] tcp_retransmit_skb+0x53d/0x638 [<ffffffff8043362a>] tcp_write_timer+0x0/0x699 [<ffffffff80433aa2>] tcp_write_timer+0x478/0x699 [<ffffffff80292b1e>] run_timer_softirq+0x13f/0x1c6 [<ffffffff802127c7>] __do_softirq+0x62/0xde [<ffffffff80260da0>] call_softirq+0x1c/0x27c [<ffffffff8026dcd2>] do_softirq+0x31/0x98 [<ffffffff8026db4d>] do_IRQ+0xec/0xf5 [<ffffffff803a0a98>] evtchn_do_upcall+0x86/0xe0 [<ffffffff802608d2>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8026f139>] raw_safe_halt+0x84/0xa8 [<ffffffff8026c683>] xen_idle+0x38/0x4a [<ffffffff8024aa45>] cpu_idle+0x97/0xba
Code: 4c 03 07 4c 13 47 08 4c 13 47 10 4c 13 47 18 4c 13 47 20 4c RIP [<ffffffff802124c2>] csum_partial+0x219/0x4bc RSP <ffff880009df3b78> CR2: ffff8800eabba000 <0>Kernel panic - not syncing: Fatal exception (XEN) Domain 0 crashed: rebooting machine in 5 seconds.