[CentOS] CentOS 6.4 tcp_fatretrans_alert causes panic

Mon Nov 28 07:29:04 UTC 2016
Zhang Qiang <dotslash.lu at gmail.com>

Hi all,

Our kernel is 2.6.32-358.14.1.x86_64, recently dozens of them panicked,
since it's been OK for a long time and the problem emerged all of a sudden,
I'm not sure if an upgrade caused this problem. Here's what I got from
backtracing:

PID: 8136   TASK: ffff8803341aead0  CPU: 2   COMMAND: ""
 #0 [ffff880028283610] panic at ffffffff815286b8
 #1 [ffff880028283690] oops_end at ffffffff8152c8a2
 #2 [ffff8800282836c0] no_context at ffffffff81046c1b
 #3 [ffff880028283710] __bad_area_nosemaphore at ffffffff81046ea5
 #4 [ffff880028283760] bad_area_nosemaphore at ffffffff81046f73
 #5 [ffff880028283770] __do_page_fault at ffffffff810476d1
 #6 [ffff880028283890] do_page_fault at ffffffff8152e7be
 #7 [ffff8800282838c0] page_fault at ffffffff8152bb75
    [exception RIP: tcp_fastretrans_alert+2754]
    RIP: ffffffff814aed62  RSP: ffff880028283970  RFLAGS: 00010246
    RAX: 0000000000000002  RBX: ffff88003d22c940  RCX: 0000000000000002
    RDX: 0000000000000000  RSI: 0000000000000003  RDI: 0000000000000000
    RBP: ffff8800282839b0   R8: 000000018033a9ac   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000d03  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #8 [ffff8800282839b8] tcp_ack at ffffffff814afb2c
 #9 [ffff880028283a88] tcp_rcv_state_process at ffffffff814b1128
#10 [ffff880028283b18] tcp_v4_do_rcv at ffffffff814b94f0
#11 [ffff880028283bb8] tcp_v4_rcv at ffffffff814baf9a
#12 [ffff880028283c48] ip_local_deliver_finish at ffffffff8149648d
#13 [ffff880028283c78] ip_local_deliver at ffffffff81496718
#14 [ffff880028283ca8] ip_rcv_finish at ffffffff81495bbd
#15 [ffff880028283ce8] ip_rcv at ffffffff81496155
#16 [ffff880028283d28] __netif_receive_skb at ffffffff8145db5b
#17 [ffff880028283d88] netif_receive_skb at ffffffff814621b8
#18 [ffff880028283dc8] virtnet_poll at ffffffffa0130565 [virtio_net]
#19 [ffff880028283e68] net_rx_action at ffffffff81463193
#20 [ffff880028283ec8] __do_softirq at ffffffff81078c71
#21 [ffff880028283f38] call_softirq at ffffffff8100c1cc
#22 [ffff880028283f50] do_softirq at ffffffff8100de05
#23 [ffff880028283f70] irq_exit at ffffffff81078a55
#24 [ffff880028283f80] do_IRQ at ffffffff81532365
--- <IRQ stack> ---
#25 [ffff88001e851f58] ret_from_intr at ffffffff8100b9d3
    RIP: 00007fa080e1a538  RSP: 00007fa0781ec960  RFLAGS: 00000206
    RAX: 0000000000000001  RBX: 00007fa0781ec9a0  RCX: 000000000001ef8c
    RDX: 0000000000001000  RSI: 0000000000000006  RDI: 00007fa07c093df8
    RBP: ffffffff8100b9ce   R8: 0000000000000006   R9: 0000000004000001
    R10: 0000000000000001  R11: 0000000000000246  R12: 0000000000000000
    R13: 00007fa0710a18f0  R14: 0000000000000120  R15: 0000000000001000
    ORIG_RAX: ffffffffffffff8e  CS: 0033  SS: 002b

disassemble tcp_fasteretrans_alert+2754 gives:

0xffffffff814aed62 <tcp_fastretrans_alert+2754>:        sub
 0x58(%rdi),%r8d

I know this kernel is a bit old, but since these kernels are in production
environment, I can't just upgrade them all to test if it's the problem of
the old version. So I need some advice on how to debug or a bug report.
Thanks.