Our Tyan VX50 server, which is otherwise stable, can be crashed by a specific user who starts a Java application. Relevant facts are: x86_64 CentOS 4.3 fully updated (but otherwise default config) 8 Dual-Core Opteron 870 (2 GHz), giving 16 cores 32 GB memory (running at DDR333)
The panic is reproducible within a couple of minutes. I tried the 2.6.9-42 largesmp kernel from J.Baron's website because it has some fixes, but the result is the same as with 2.6.9-34.0.2-largesmp . The panic is not triggered by memory overcommitment, when the crash occurs, there is still 8GB free memory and the swap space is untouched.
This is the start of the dmesg output which I could capture from a "netconsole":
NMI Watchdog detected LOCKUP, CPU=6, registers: CPU 6 Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 w83627hf i2c_sensor i2c_isa i2c_dev i2c_core sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd ehci_hcd tg3 floppy ext3 jbd raid1 raid0 sata_nv libata sd_mod scsi_mod Pid: 16269, comm: java Tainted: PF 2.6.9-42.ELlargesmp RIP: 0010:[<ffffffff8030a9aa>] <ffffffff8030a9aa>{.text.lock.spinlock+46} RSP: 0000:000001032d1a1e08 EFLAGS: 00000086 RAX: 000001032c2e5030 RBX: 00000107fc53dac8 RCX: 0000000000000000 RDX: 000001032d1a1f58 RSI: 000001032d1a1ef8 RDI: 00000107fc53dac8 RBP: 000001032d1a1e78 R08: 0000000000000007 R09: 0000002a95905ac8 R10: 0000002a95905ad6 R11: 0000000000000000 R12: 0000000000000000 R13: 000001032d1a1ef8 R14: 000001032d1a1f58 R15: 000001032c2e5708 FS: 0000000042a7b960(005b) GS:ffffffff804f8000(0000) knlGS:00000000f7e886c0 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000002a95ed2f88 CR3: 00000006fc7a2000 CR4: 00000000000006e0 Process java (pid: 16269, threadinfo 000001032d1a0000, task 000001032c2e5030) Stack: 000001032d1a1f58 ffffffff80142d80 0000000042a78610 000001032d1a1f58 0000000042a78610 0000000000000000 0000002be7877da0 0000000000000042 000001032d1a1ef8 ffffffff8010f6fb Call Trace:<ffffffff80142d80>{get_signal_to_deliver+64} <ffffffff8010f6fb>{do_signal+131} <ffffffff801100b4>{sys_rt_sigreturn+619} <ffffffff80110129>{sys_rt_sigreturn+736} <ffffffff80110902>{retint_signal+62}
Code: 83 3b 00 7e f9 e9 91 fd ff ff f3 90 83 3b 00 7e f9 e9 cf fd Kernel panic - not syncing: nmi watchdog ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at panic:75 invalid operand: 0000 [1] SMP CPU 6 Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 w83627hf i2c_sensor i2c_isa i2c_dev i2c_core sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd ehci_hcd tg3 floppy ext3 jbd raid1 raid0 sata_nv libata sd_mod scsi_mod Pid: 16269, comm: java Tainted: PF 2.6.9-42.ELlargesmp RIP: 0010:[<ffffffff801376ae>] <ffffffff801376ae>{panic+211} RSP: 0000:00000102fc7a6da8 EFLAGS: 00010086 RAX: 000000000000002c RBX: ffffffff8031d55d RCX: 0000000000000046 RDX: 0000000000029c01 RSI: 0000000000000046 RDI: ffffffff803e3700 RBP: 00000102fc7a6f58 R08: 0000000000000004 R09: ffffffff8031d55d R10: 0000000000000000 R11: 0000000000000029 R12: 0000000000000000 R13: 000001032d1a1ef8 R14: 000001032d1a1f58 R15: 000001032c2e5708 FS: 0000000042a7b960(005b) GS:ffffffff804f8000(0000) knlGS:00000000f7e886c0 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000002a95ed2f88 CR3: 00000006fc7a2000 CR4: 00000000000006e0 Process java (pid: 16269, threadinfo 000001032d1a0000, task 000001032c2e5030) Stack: 0000003000000008 00000102fc7a6e88 00000102fc7a6dc8 0000000000000013 0000000000000000 0000000000000046 0000000000029bd5 0000000000000046 0000000000000004 ffffffff8031fad8 Call Trace:<ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d055>{nmi_watchdog_tick+210} <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d10b>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffff8030a9aa>{.text.lock.spinlock+46} <ffffffff8013b98a>{it_real_fn+0} <ffffffff8013b98a>{it_real_fn+0}
Code: 0f 0b dd da 31 80 ff ff ff ff 4b 00 31 ff e8 83 c1 fe ff e8 RIP <ffffffff801376ae>{panic+211} RSP <00000102fc7a6da8>
Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 w83627hf i2c_sensor i2c_isa i2c_dev i2c_core sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd ehci_hcd tg3 floppy ext3 jbd raid1 raid0 sata_nv libata sd_mod scsi_mod Pid: 16269, comm: java Tainted: PF 2.6.9-42.ELlargesmp RIP: 0010:[<ffffffff801376ae>] <ffffffff801376ae>{panic+211} RSP: 0000:00000102fc7a6da8 EFLAGS: 00010086 RAX: 000000000000002c RBX: ffffffff8031d55d RCX: 0000000000000046 RDX: 0000000000029c01 RSI: 0000000000000046 RDI: ffffffff803e3700 RBP: 00000102fc7a6f58 R08: 0000000000000004 R09: ffffffff8031d55d R10: 0000000000000000 R11: 0000000000000029 R12: 0000000000000000 R13: 000001032d1a1ef8 R14: 000001032d1a1f58 R15: 000001032c2e5708 FS: 0000000042a7b960(005b) GS:ffffffff804f8000(0000) knlGS:00000000f7e886c0 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000002a95ed2f88 CR3: 00000006fc7a2000 CR4: 00000000000006e0
Call Trace:<ffffffff8013769a>{panic+191} <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d055>{nmi_watchdog_tick+210} <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d10b>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffff8030a9aa>{.text.lock.spinlock+46} <ffffffff8013b98a>{it_real_fn+0} <ffffffff8013b98a>{it_real_fn+0}
sibling task PC pid father child younger older init S 000000010fc7345e 0 1 0 2 (NOTLB) 00000101000a1d78 0000000000000002 00000101000a1e58 ffffffff00000075 0000000000000000 00000000000a1e58 000001060006aa80 0000000e00000246 0000010037ecb7f0 0000000000001b5d Call Trace:<ffffffff8013fa18>{__mod_timer+293} <ffffffff8030a1ef>{schedule_timeout+367} <ffffffff80140442>{process_timeout+0} <ffffffff8018add7>{do_select+939} <ffffffff8018a971>{__pollwait+0} <ffffffff8018b156>{sys_select+820} <ffffffff8011026a>{system_call+126} migration/0 S 0000010008002a20 0 2 1 3 (L-TLB) 00000102fc72dec8 0000000000000046 0000010100069760 0000000000000002 00000001fc72dec8 0000000000000000 0000000000000012 0000000000000001 000001010001f7f0 00000000000002b7 Call Trace:<ffffffff80133b66>{__wake_up_common+67} <ffffffff80134c85>{migration_thread+323} <ffffffff80134b42>{migration_thread+0} <ffffffff8014b22a>{kthread+199} <ffffffff80110f47>{child_rip+8} <ffffffff8014b163>{kthread+0} <ffffffff80110f3f>{child_rip+0} ksoftirqd/0 S 0000000000000000 0 3 1 4 2 (L-TLB) 00000104fc73ff08 0000000000000046 0000010400000000 0000000000000246 00000104fc73e000 0000000000000246 0000010008004cc0 0000000000000000 000001010001f030 000000000000028b Call Trace:<ffffffff8013c9f4>{ksoftirqd+0} <ffffffff8013ca30>{ksoftirqd+60} <ffffffff8014b22a>{kthread+199} <ffffffff80110f47>{child_rip+8} <ffffffff8014b163>{kthread+0} <ffffffff80110f3f>{child_rip+0}
The full output is at http://strucbio.biologie.uni-konstanz.de/~kay/crash.log .
Can anybody tell me what's wrong?
thanks,
Kay
On Mon, Jul 31, 2006 at 11:28:21PM +0200, Kay Diederichs wrote:
Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd
^^^^^, ^^^^^ [...]
Pid: 16269, comm: java Tainted: PF 2.6.9-42.ELlargesmp
Your kernel is tainted by binary-only kernel modules. Can you reproduce this without these modules loaded?
Matthew Miller wrote:
On Mon, Jul 31, 2006 at 11:28:21PM +0200, Kay Diederichs wrote:
Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd
^^^^^, ^^^^^
[...]
Pid: 16269, comm: java Tainted: PF 2.6.9-42.ELlargesmp
Your kernel is tainted by binary-only kernel modules. Can you reproduce this without these modules loaded?
VMware?
Feizhou wrote:
Matthew Miller wrote:
On Mon, Jul 31, 2006 at 11:28:21PM +0200, Kay Diederichs wrote:
Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd
^^^^^, ^^^^^
[...]
Pid: 16269, comm: java Tainted: PF 2.6.9-42.ELlargesmp
Your kernel is tainted by binary-only kernel modules. Can you reproduce this without these modules loaded?
VMware? _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
I forgot in my first posting to mention that the problem has nothing to do with the kernel being tainted (due to VMware server).
I installed VMware server _after_ being faced with the crashes, to give the Java user a "sandbox" which s/he could crash without taking down the real server. Only then I discovered that VMware server has a memory limitation to 3600MB (I guess because it's not a 64bit application).
I see exactly the same crash with a non-tainted kernel.
I take the opportunity to add that I just ran memtest86+ for 20 hours, without any indication of a hardware (memory) problem.
Kay
Kay Diederichs wrote:
Feizhou wrote:
Matthew Miller wrote:
On Mon, Jul 31, 2006 at 11:28:21PM +0200, Kay Diederichs wrote:
Modules linked in: vmnet(U) vmmon(U) netconsole netdump ppdev nfs nfsd
^^^^^, ^^^^^
[...]
Pid: 16269, comm: java Tainted: PF 2.6.9-42.ELlargesmp
Your kernel is tainted by binary-only kernel modules. Can you reproduce this without these modules loaded?
VMware? _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
I forgot in my first posting to mention that the problem has nothing to do with the kernel being tainted (due to VMware server).
I installed VMware server _after_ being faced with the crashes, to give the Java user a "sandbox" which s/he could crash without taking down the real server. Only then I discovered that VMware server has a memory limitation to 3600MB (I guess because it's not a 64bit application).
I see exactly the same crash with a non-tainted kernel.
I take the opportunity to add that I just ran memtest86+ for 20 hours, without any indication of a hardware (memory) problem.
Kay
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
This is now bugzilla entry 20885, please see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200885
Kay