[CentOS] vmcore on 5.4

Fri Apr 23 14:50:53 UTC 2010
My LinuxHAList <mylinuxhalist at gmail.com>

Information: 5.4 kernel (2.6.18-164.el5).

I have a vmcore (from kdump), if the developers are interested, let me know
a place to upload the vmcore file.

I used the crash command to do a backtrace.

I manage to get machines with later 5.4 and 5.5 to panic the same way.
Broadcom or Intel NICs panic the same way.

This is an NFS client where the NFS server is restarting several times;
NFSv3, mount it with defaults,noatime.
The client was busy writing things on NFS-mounted space while the NFS
servers was restarting several times.
So far, if I mount it with udp option, I've not managed to panic the
machines.
The bad news is that NFSv4 is strictly TCP, if I were to go down that route.

>From the backtrace, it seems the crash is TCP-related.  I'll be trying
couple Linux TCP settings changes.
It's a possibility that the issues are with TCP in general (not NFS).
I would like to enlist community's help in further understanding this and
potential work-arounds with this TCP issues.

crash> sys
      KERNEL: vmlinux
    DUMPFILE: vmcore
        CPUS: 4
        DATE: Tue Apr 20 15:04:09 2010
      UPTIME: 18:55:25
LOAD AVERAGE: 0.13, 0.09, 0.03
       TASKS: 340
     RELEASE: 2.6.18-164.el5
     VERSION: #1 SMP Thu Sep 3 03:28:30 EDT 2009
     MACHINE: x86_64  (2660 Mhz)
      MEMORY: 23.6 GB
       PANIC: "Oops: 0000 [1] SMP " (check log for details)
crash> bt -a
PID: 0      TASK: ffffffff802ffae0  CPU: 0   COMMAND: "swapper"
 #0 [ffffffff8043ef20] crash_nmi_callback at ffffffff8007a3bf
 #1 [ffffffff8043ef40] do_nmi at ffffffff8006585a
 #2 [ffffffff8043ef50] nmi at ffffffff80064ebf
    [exception RIP: acpi_processor_idle+579]
    RIP: ffffffff8019765e  RSP: ffffffff803f1f48  RFLAGS: 00000093
    RAX: 000000000073111a  RBX: 000000000073111a  RCX: 0000000000000808
    RDX: 0000000000000815  RSI: 0000000000000003  RDI: 0000000000000000
    RBP: ffff81063e480100   R8: ffffffff803f0000   R9: ffffffff804b5e2c
    R10: 0000000000000046  R11: 0000000000000046  R12: 0000000000000000
    R13: ffff81063e480000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
 #3 [ffffffff803f1f48] acpi_processor_idle at ffffffff8019765e
 #4 [ffffffff803f1f90] cpu_idle at ffffffff8004939e
PID: 0      TASK: ffff810115f11100  CPU: 1   COMMAND: "swapper"
 #0 [ffff810115f38f20] crash_nmi_callback at ffffffff8007a3bf
 #1 [ffff810115f38f40] do_nmi at ffffffff8006585a
 #2 [ffff810115f38f50] nmi at ffffffff80064ebf
    [exception RIP: acpi_processor_idle+579]
    RIP: ffffffff8019765e  RSP: ffff810115f2fea8  RFLAGS: 00000093
    RAX: 0000000000731145  RBX: 0000000000731145  RCX: 0000000000000808
    RDX: 0000000000000815  RSI: 0000000000000003  RDI: 0000000000000000
    RBP: ffff81063f173900   R8: ffff810115f2e000   R9: ffffffff804b5e2c
    R10: 0000000000000046  R11: 0000000000000046  R12: 00000000000000ff
    R13: ffff81063f173800  R14: 0000000000000100  R15: ffffffff803ea280
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
 #3 [ffff810115f2fea8] acpi_processor_idle at ffffffff8019765e
 #4 [ffff810115f2fef0] cpu_idle at ffffffff8004939e
PID: 0      TASK: ffff810115f20080  CPU: 2   COMMAND: "swapper"
 #0 [ffff810115f6bbc0] crash_kexec at ffffffff800ac5b9
 #1 [ffff810115f6bc80] __die at ffffffff80065127
 #2 [ffff810115f6bcc0] do_page_fault at ffffffff80066da7
 #3 [ffff810115f6bdb0] error_exit at ffffffff8005dde9
    [exception RIP: pskb_copy+307]
    RIP: ffffffff8022486b  RSP: ffff810115f6be60  RFLAGS: 00010282
    RAX: ffff81062cd5f540  RBX: ffff81062cac3980  RCX: ffff81046fb1e550
    RDX: 0000000000000000  RSI: ffff81062cd5f550  RDI: 0000000000000004
    RBP: ffff810466f54a80   R8: 00000000081f02b4   R9: 0000000000000000
    R10: ffff81062cac3980  R11: 00000000000000c8  R12: 0000000000000220
    R13: ffff810466f54a80  R14: 0000000000000002  R15: ffffffff803ea2a0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff810115f6be78] tcp_transmit_skb at ffffffff800217b7
 #5 [ffff810115f6bec8] tcp_retransmit_skb at ffffffff80250ccd
 #6 [ffff810115f6bf08] tcp_write_timer at ffffffff80252652
 #7 [ffff810115f6bf28] run_timer_softirq at ffffffff800968be
 #8 [ffff810115f6bf58] __do_softirq at ffffffff8001235a
 #9 [ffff810115f6bf88] call_softirq at ffffffff8005e2fc
#10 [ffff810115f6bfa0] do_softirq at ffffffff8006cb14
#11 [ffff810115f6bfb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#12 [ffff810115f67df8] apic_timer_interrupt at ffffffff8005dc8e
    [exception RIP: acpi_processor_idle+628]
    RIP: ffffffff8019768f  RSP: ffff810115f67ea8  RFLAGS: 00000282
    RAX: ffff810115f67fd8  RBX: ffff81063f173100  RCX: 0000000080184973
    RDX: ffff81063f173000  RSI: 0000000000000082  RDI: ffffffff804b5e2c
    RBP: ffff810115f67ee8   R8: ffff810115f66000   R9: ffff810115f67ecc
    R10: 0000000000000046  R11: ffff810115f67ee8  R12: ffff81063f6e1180
    R13: 0000000010008040  R14: ffff81063f6e1180  R15: ffff81063f6e1180
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#13 [ffff810115f67ea0] acpi_processor_idle at ffffffff80197685
#14 [ffff810115f67ef0] cpu_idle at ffffffff8004939e
PID: 0      TASK: ffff810115f94100  CPU: 3   COMMAND: "swapper"
 #0 [ffff810115fbbf20] crash_nmi_callback at ffffffff8007a3bf
 #1 [ffff810115fbbf40] do_nmi at ffffffff8006585a
 #2 [ffff810115fbbf50] nmi at ffffffff80064ebf
    [exception RIP: acpi_processor_idle+579]
    RIP: ffffffff8019765e  RSP: ffff810115fb9ea8  RFLAGS: 00000097
    RAX: 0000000000731169  RBX: 0000000000731169  RCX: 0000000000000808
    RDX: 0000000000000815  RSI: 0000000000000003  RDI: 0000000000000000
    RBP: ffff81063f174900   R8: ffff810115fb8000   R9: ffff810115f942f0
    R10: 0000000000000046  R11: 0000000000000046  R12: 00000000000000ff
    R13: ffff81063f174800  R14: 0000000000000300  R15: ffffffff803ea2c0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
 #3 [ffff810115fb9ea8] acpi_processor_idle at ffffffff8019765e
 #4 [ffff810115fb9ef0] cpu_idle at ffffffff8004939e
crash> quit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20100423/020e973e/attachment-0004.html>