[CentOS] x86_64 reproducible server PANIC with latest kernel

Mon Jul 31 21:28:21 UTC 2006
Kay Diederichs <kay.diederichs at uni-konstanz.de>

Our Tyan VX50 server, which is otherwise stable, can be crashed by a 
specific user who starts a Java application.
Relevant facts are:
x86_64 CentOS 4.3 fully updated (but otherwise default config)
8 Dual-Core Opteron 870 (2 GHz), giving 16 cores
32 GB memory (running at DDR333)

The panic is reproducible within a couple of minutes.
I tried the 2.6.9-42 largesmp kernel from J.Baron's website because it 
has some fixes, but the result is the same as with 2.6.9-34.0.2-largesmp 
. The panic is not triggered by memory overcommitment, when the crash 
occurs, there is still 8GB free memory and the swap space is untouched.

This is the start of the dmesg output which I could capture from a 
"netconsole":

NMI Watchdog detected LOCKUP, CPU=6, registers:
CPU 6
Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd 
exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 w83627hf 
i2c_sensor i2c_isa i2c_dev i2c_core sunrpc ipt_REJECT ipt_state 
ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac 
ohci_hcd ehci_hcd tg3 floppy ext3 jbd raid1 raid0 sata_nv libata sd_mod 
scsi_mod
Pid: 16269, comm: java Tainted: PF     2.6.9-42.ELlargesmp
RIP: 0010:[<ffffffff8030a9aa>] <ffffffff8030a9aa>{.text.lock.spinlock+46}
RSP: 0000:000001032d1a1e08  EFLAGS: 00000086
RAX: 000001032c2e5030 RBX: 00000107fc53dac8 RCX: 0000000000000000
RDX: 000001032d1a1f58 RSI: 000001032d1a1ef8 RDI: 00000107fc53dac8
RBP: 000001032d1a1e78 R08: 0000000000000007 R09: 0000002a95905ac8
R10: 0000002a95905ad6 R11: 0000000000000000 R12: 0000000000000000
R13: 000001032d1a1ef8 R14: 000001032d1a1f58 R15: 000001032c2e5708
FS:  0000000042a7b960(005b) GS:ffffffff804f8000(0000) knlGS:00000000f7e886c0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000002a95ed2f88 CR3: 00000006fc7a2000 CR4: 00000000000006e0
Process java (pid: 16269, threadinfo 000001032d1a0000, task 
000001032c2e5030)
Stack: 000001032d1a1f58 ffffffff80142d80 0000000042a78610 000001032d1a1f58
        0000000042a78610 0000000000000000 0000002be7877da0 0000000000000042
        000001032d1a1ef8 ffffffff8010f6fb
Call Trace:<ffffffff80142d80>{get_signal_to_deliver+64} 
<ffffffff8010f6fb>{do_signal+131}
        <ffffffff801100b4>{sys_rt_sigreturn+619} 
<ffffffff80110129>{sys_rt_sigreturn+736}
        <ffffffff80110902>{retint_signal+62}

Code: 83 3b 00 7e f9 e9 91 fd ff ff f3 90 83 3b 00 7e f9 e9 cf fd
Kernel panic - not syncing: nmi watchdog
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at panic:75
invalid operand: 0000 [1] SMP
CPU 6
Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd 
exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 w83627hf 
i2c_sensor i2c_isa i2c_dev i2c_core sunrpc ipt_REJECT ipt_state 
ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac 
ohci_hcd ehci_hcd tg3 floppy ext3 jbd raid1 raid0 sata_nv libata sd_mod 
scsi_mod
Pid: 16269, comm: java Tainted: PF     2.6.9-42.ELlargesmp
RIP: 0010:[<ffffffff801376ae>] <ffffffff801376ae>{panic+211}
RSP: 0000:00000102fc7a6da8  EFLAGS: 00010086
RAX: 000000000000002c RBX: ffffffff8031d55d RCX: 0000000000000046
RDX: 0000000000029c01 RSI: 0000000000000046 RDI: ffffffff803e3700
RBP: 00000102fc7a6f58 R08: 0000000000000004 R09: ffffffff8031d55d
R10: 0000000000000000 R11: 0000000000000029 R12: 0000000000000000
R13: 000001032d1a1ef8 R14: 000001032d1a1f58 R15: 000001032c2e5708
FS:  0000000042a7b960(005b) GS:ffffffff804f8000(0000) knlGS:00000000f7e886c0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000002a95ed2f88 CR3: 00000006fc7a2000 CR4: 00000000000006e0
Process java (pid: 16269, threadinfo 000001032d1a0000, task 
000001032c2e5030)
Stack: 0000003000000008 00000102fc7a6e88 00000102fc7a6dc8 0000000000000013
        0000000000000000 0000000000000046 0000000000029bd5 0000000000000046
        0000000000000004 ffffffff8031fad8
Call Trace:<ffffffff80111860>{show_stack+241} 
<ffffffff8011198a>{show_registers+277}
        <ffffffff80111c91>{die_nmi+130} 
<ffffffff8011d055>{nmi_watchdog_tick+210}
        <ffffffff8011255e>{default_do_nmi+112} 
<ffffffff8011d10b>{do_nmi+115}
        <ffffffff80111173>{paranoid_exit+0} 
<ffffffff8030a9aa>{.text.lock.spinlock+46}
        <ffffffff8013b98a>{it_real_fn+0} <ffffffff8013b98a>{it_real_fn+0}


Code: 0f 0b dd da 31 80 ff ff ff ff 4b 00 31 ff e8 83 c1 fe ff e8
RIP <ffffffff801376ae>{panic+211} RSP <00000102fc7a6da8>

Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd 
exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 w83627hf 
i2c_sensor i2c_isa i2c_dev i2c_core sunrpc ipt_REJECT ipt_state 
ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac 
ohci_hcd ehci_hcd tg3 floppy ext3 jbd raid1 raid0 sata_nv libata sd_mod 
scsi_mod
Pid: 16269, comm: java Tainted: PF     2.6.9-42.ELlargesmp
RIP: 0010:[<ffffffff801376ae>] <ffffffff801376ae>{panic+211}
RSP: 0000:00000102fc7a6da8  EFLAGS: 00010086
RAX: 000000000000002c RBX: ffffffff8031d55d RCX: 0000000000000046
RDX: 0000000000029c01 RSI: 0000000000000046 RDI: ffffffff803e3700
RBP: 00000102fc7a6f58 R08: 0000000000000004 R09: ffffffff8031d55d
R10: 0000000000000000 R11: 0000000000000029 R12: 0000000000000000
R13: 000001032d1a1ef8 R14: 000001032d1a1f58 R15: 000001032c2e5708
FS:  0000000042a7b960(005b) GS:ffffffff804f8000(0000) knlGS:00000000f7e886c0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000002a95ed2f88 CR3: 00000006fc7a2000 CR4: 00000000000006e0

Call Trace:<ffffffff8013769a>{panic+191} <ffffffff80111860>{show_stack+241}
        <ffffffff8011198a>{show_registers+277} 
<ffffffff80111c91>{die_nmi+130}
        <ffffffff8011d055>{nmi_watchdog_tick+210} 
<ffffffff8011255e>{default_do_nmi+112}
        <ffffffff8011d10b>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0}
        <ffffffff8030a9aa>{.text.lock.spinlock+46} 
<ffffffff8013b98a>{it_real_fn+0}
        <ffffffff8013b98a>{it_real_fn+0}

                                                        sibling
   task                 PC          pid father child younger older
init          S 000000010fc7345e     0     1      0     2 
(NOTLB)
00000101000a1d78 0000000000000002 00000101000a1e58 ffffffff00000075
        0000000000000000 00000000000a1e58 000001060006aa80 0000000e00000246
        0000010037ecb7f0 0000000000001b5d
Call Trace:<ffffffff8013fa18>{__mod_timer+293} 
<ffffffff8030a1ef>{schedule_timeout+367}
        <ffffffff80140442>{process_timeout+0} 
<ffffffff8018add7>{do_select+939}
        <ffffffff8018a971>{__pollwait+0} <ffffffff8018b156>{sys_select+820}
        <ffffffff8011026a>{system_call+126}
migration/0   S 0000010008002a20     0     2      1             3 
(L-TLB)
00000102fc72dec8 0000000000000046 0000010100069760 0000000000000002
        00000001fc72dec8 0000000000000000 0000000000000012 0000000000000001
        000001010001f7f0 00000000000002b7
Call Trace:<ffffffff80133b66>{__wake_up_common+67} 
<ffffffff80134c85>{migration_thread+323}
        <ffffffff80134b42>{migration_thread+0} 
<ffffffff8014b22a>{kthread+199}
        <ffffffff80110f47>{child_rip+8} <ffffffff8014b163>{kthread+0}
        <ffffffff80110f3f>{child_rip+0}
ksoftirqd/0   S 0000000000000000     0     3      1             4     2 
(L-TLB)
00000104fc73ff08 0000000000000046 0000010400000000 0000000000000246
        00000104fc73e000 0000000000000246 0000010008004cc0 0000000000000000
        000001010001f030 000000000000028b
Call Trace:<ffffffff8013c9f4>{ksoftirqd+0} <ffffffff8013ca30>{ksoftirqd+60}
        <ffffffff8014b22a>{kthread+199} <ffffffff80110f47>{child_rip+8}
        <ffffffff8014b163>{kthread+0} <ffffffff80110f3f>{child_rip+0}


The full output is at 
http://strucbio.biologie.uni-konstanz.de/~kay/crash.log .

Can anybody tell me what's wrong?

thanks,

Kay
-- 
Kay Diederichs              http://strucbio.biologie.uni-konstanz.de
email: Kay.Diederichs at uni-konstanz.de  Tel +49 7531 88 4049 Fax 3183
Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz