Hi all,
We are running a new Centos-4 server, and it has kernel panicked on us 4 times in the last month. After the first kernel panic we hooked up a serial console to the server and captured the output in order to have a record of what happens. I've included the error messages from the last time it locked up... but it doesn't really mean much to me. Anybody have any ideas what might be causing this server lock up?
Server description: -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration -ext3 file system -kernel-smp-2.6.9-5.0.3.EL -mysql - from distribution -2 postfix instances rebuilt with MySQL support -amavisd-new -clamav -spamassassin -rbldnsd -bind
Here's the captured output from a serial console connected to the server at time of fault.
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8872da8 *pde = 35562001 Oops: 0000 [#1] SMP Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod CPU: 1 EIP: 0060:[<f8872da8>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) EIP is at __journal_file_buffer+0x1b/0x221 [jbd] eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 ds: 007b es: 007b ss: 0068 Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b 00000286 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 caa4c50c 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 f8836f24 Call Trace: [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bcd5>] finish_task_switch+0x30/0x66 [<c02c4363>] schedule+0x833/0x869 [<c0127e62>] del_timer_sync+0x7a/0x9c [<f8875e6d>] kjournald+0xc7/0x215 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bd1d>] schedule_tail+0x12/0x55 [<f8875da0>] commit_timeout+0x0/0x5 [jbd] [<f8875da6>] kjournald+0x0/0x215 [jbd] [<c01041f1>] kernel_thread_helper+0x5/0xb Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00 a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55
Bob Pierce wrote:
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8872da8 *pde = 35562001 Oops: 0000 [#1] SMP
No expert here, but just had this same type of error on a workstation.
Wouldn't even boot anymore, panic on start up. I personally had never seen this error before.
Pulled ram modules, cleaned contacts and reseated back in place. Has not happened again. Soooooo, I'd test/change out memory.
Just a thought.
Bob Pierce wrote:
Hi all,
We are running a new Centos-4 server, and it has kernel panicked on us 4 times in the last month. After the first kernel panic we hooked up a serial console to the server and captured the output in order to have a record of what happens. I've included the error messages from the last time it locked up… but it doesn't really mean much to me. Anybody have any ideas what might be causing this server lock up?
Server description: -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration
-ext3 file system -kernel-smp-2.6.9-5.0.3.EL -mysql - from distribution -2 postfix instances rebuilt with MySQL support -amavisd-new -clamav -spamassassin -rbldnsd -bind
Here's the captured output from a serial console connected to the server at time of fault.
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8872da8 *pde = 35562001 Oops: 0000 [#1] SMP Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
CPU: 1 EIP: 0060:[<f8872da8>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) EIP is at __journal_file_buffer+0x1b/0x221 [jbd] eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 ds: 007b es: 007b ss: 0068 Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b 00000286 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 caa4c50c 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 f8836f24 Call Trace: [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bcd5>] finish_task_switch+0x30/0x66 [<c02c4363>] schedule+0x833/0x869 [<c0127e62>] del_timer_sync+0x7a/0x9c [<f8875e6d>] kjournald+0xc7/0x215 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bd1d>] schedule_tail+0x12/0x55 [<f8875da0>] commit_timeout+0x0/0x5 [jbd] [<f8875da6>] kjournald+0x0/0x215 [jbd] [<c01041f1>] kernel_thread_helper+0x5/0xb Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00 a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Looks to me as there is a problem with the RAID, I'm not too familiar with LSI oems for dell(I'm guessing it's LSI, since it said something about megaraid I'm too lazy to google it), but I'm guessing that Perc4-DI is a host raid? I would look into it, and really think about getting a hardware raid card if it is. I've had nothing but problems with onboard host raids myself, I gave up with them and just went and used LVM's software raid, it actually performs much better now. I've even seen benchmarks saying the same thing. But we are still switching to hardware raid, for much easier restoring.
On Tue, April 12, 2005 3:08 pm, Bob Pierce said:
Hi all,
We are running a new Centos-4 server, and it has kernel panicked on us 4 times in the last month. After the first kernel panic we hooked up a serial console to the server and captured the output in order to have a record of what happens. I've included the error messages from the last time it locked up... but it doesn't really mean much to me. Anybody have any ideas what might be causing this server lock up?
Server description: -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM
- Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration
-ext3 file system -kernel-smp-2.6.9-5.0.3.EL -mysql - from distribution -2 postfix instances rebuilt with MySQL support -amavisd-new -clamav -spamassassin -rbldnsd -bind
Here's the captured output from a serial console connected to the server at time of fault.
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8872da8 *pde = 35562001 Oops: 0000 [#1] SMP Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod CPU: 1 EIP: 0060:[<f8872da8>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) EIP is at __journal_file_buffer+0x1b/0x221 [jbd] eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 ds: 007b es: 007b ss: 0068 Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b 00000286 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 caa4c50c 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 f8836f24 Call Trace: [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bcd5>] finish_task_switch+0x30/0x66 [<c02c4363>] schedule+0x833/0x869 [<c0127e62>] del_timer_sync+0x7a/0x9c [<f8875e6d>] kjournald+0xc7/0x215 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bd1d>] schedule_tail+0x12/0x55 [<f8875da0>] commit_timeout+0x0/0x5 [jbd] [<f8875da6>] kjournald+0x0/0x215 [jbd] [<c01041f1>] kernel_thread_helper+0x5/0xb Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00 a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55
No idea what is causing this (looks like a Filesystem process to me), but we have a new kernel (that will be included in CentOS-4.1). It is kernel-2.6.9-6.37.EL.src.rpm.
I would be glad to give you the new i686-smp kernel to see if it solves your problem.
Are these EM64T Xeons or i686(32-bit) Xeons: http://www.intel.com/products/processor/xeon/index.htm (looking at the Dell site, I think they are 32-bit)
(If I am wrong and it is the EM64T Xeons, you should have installed the x86_64 distro instead of the i386 one)
Also recommend the latest SCSI Controller BIOS: http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=e...
and Server BIOS: http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=e...
--- Johnny Hughes mailing-lists@hughesjr.com wrote:
On Tue, April 12, 2005 3:08 pm, Bob Pierce said:
Hi all,
We are running a new Centos-4 server, and it has
kernel panicked on us 4
times in the last month. After the first kernel
panic we hooked up a
serial console to the server and captured the
output in order to have a
record of what happens. I've included the error
messages from the last
time it locked up... but it doesn't really mean
much to me. Anybody have
any ideas what might be causing this server lock
up?
Server description: -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper
Threading on) - 2GB DDR RAM
- Perc4-DI onboard RAID using 3 scsi drives in
raid-5 configuration
-ext3 file system -kernel-smp-2.6.9-5.0.3.EL -mysql - from distribution -2 postfix instances rebuilt with MySQL support -amavisd-new -clamav -spamassassin -rbldnsd -bind
Here's the captured output from a serial console
connected to the server
at time of fault.
Unable to handle kernel NULL pointer dereference
at virtual address
00000000 printing eip: f8872da8 *pde = 35562001 Oops: 0000 [#1] SMP Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod
button battery ac
ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox
megaraid_mm sd_mod
scsi_mod CPU: 1 EIP: 0060:[<f8872da8>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) EIP is at __journal_file_buffer+0x1b/0x221 [jbd] eax: 00000000 ebx: d2fff26c ecx: 00000008
edx: c2327680
esi: c2327680 edi: 00000008 ebp: 00000000
esp: f7533dd4
ds: 007b es: 007b ss: 0068 Process kjournald (pid: 210, threadinfo=f7533000
task=f75825b0)
Stack: 00000000 00000000 f148fad8 f7f66200
d2fff26c c2327680 f887351b
00000286 00000000 00000000 00000000 00000000
00000000 d2517e6c f7f66200
caa4c50c 00001f18 00000000 f75825b0 c011e8d2
f7533e44 f7533e44 f750c054
f8836f24 Call Trace: [<f887351b>]
journal_commit_transaction+0x310/0xfb1 [jbd]
[<c011e8d2>] autoremove_wake_function+0x0/0x2d [<f8836f24>] megaraid_isr+0x1ad/0x1bf
[megaraid_mbox]
[<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bcd5>] finish_task_switch+0x30/0x66 [<c02c4363>] schedule+0x833/0x869 [<c0127e62>] del_timer_sync+0x7a/0x9c [<f8875e6d>] kjournald+0xc7/0x215 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bd1d>] schedule_tail+0x12/0x55 [<f8875da0>] commit_timeout+0x0/0x5 [jbd] [<f8875da6>] kjournald+0x0/0x215 [jbd] [<c01041f1>] kernel_thread_helper+0x5/0xb Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d
c3 55 31 ed 57 89 cf
56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b
00 89 04 24 <8b> 00
a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00
68 55
No idea what is causing this (looks like a Filesystem process to me), but we have a new kernel (that will be included in CentOS-4.1). It is kernel-2.6.9-6.37.EL.src.rpm.
I would be glad to give you the new i686-smp kernel to see if it solves your problem.
Are these EM64T Xeons or i686(32-bit) Xeons:
http://www.intel.com/products/processor/xeon/index.htm
(looking at the Dell site, I think they are 32-bit)
(If I am wrong and it is the EM64T Xeons, you should have installed the x86_64 distro instead of the i386 one)
Also recommend the latest SCSI Controller BIOS:
http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=e...
and Server BIOS:
http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=e...
-- Johnny Hughes http://www.HughesJR.com/
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
wow, this looks and sounds like the same problems i was having with my box. But mine was once a day that it would lock up. i guess i am going to have to wait until 4.1 before i think about upgrading to centos 4*...
Steven
"On the side of the software box, in the 'System Requirements' section, it said 'Requires Windows or better'. So I installed Linux."
__________________________________ Do you Yahoo!? Yahoo! Small Business - Try our new resources site! http://smallbusiness.yahoo.com/resources/
Perhaps I'm just blessed that I don't patronize Dell....but I'm running 4.0 on a few systems and haven't experienced any kernel panics.
My advice, like Johnny's, is to update all the bios and various firmwares on crucial controllers. I have to say that I haven't had the best results from LSI SCSI cards in the past so you may want to focus your troubleshooting there.
Dell support for Linux has been non-existent/useless for me which is why I don't buy anything from them anymore.
Best regards,
C
Have a closer look at jbd :-)
On 4/12/05, Bob Pierce pierceb@westmancom.com wrote:
Hi all,
We are running a new Centos-4 server, and it has kernel panicked on us 4 times in the last month. After the first kernel panic we hooked up a serial console to the server and captured the output in order to have a record of what happens. I've included the error messages from the last time it locked up… but it doesn't really mean much to me. Anybody have any ideas what might be causing this server lock up?
Server description: -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration
-ext3 file system -kernel-smp-2.6.9-5.0.3.EL -mysql - from distribution -2 postfix instances rebuilt with MySQL support -amavisd-new -clamav -spamassassin -rbldnsd -bind
Here's the captured output from a serial console connected to the server at time of fault.
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8872da8 *pde = 35562001 Oops: 0000 [#1] SMP Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
CPU: 1 EIP: 0060:[<f8872da8>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) EIP is at __journal_file_buffer+0x1b/0x221 [jbd] eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 ds: 007b es: 007b ss: 0068 Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b 00000286 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 caa4c50c 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 f8836f24 Call Trace: [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bcd5>] finish_task_switch+0x30/0x66 [<c02c4363>] schedule+0x833/0x869 [<c0127e62>] del_timer_sync+0x7a/0x9c [<f8875e6d>] kjournald+0xc7/0x215 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bd1d>] schedule_tail+0x12/0x55 [<f8875da0>] commit_timeout+0x0/0x5 [jbd] [<f8875da6>] kjournald+0x0/0x215 [jbd] [<c01041f1>] kernel_thread_helper+0x5/0xb Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00 a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
This is not necessarily a problem with your hardware but could be a bonified bug in the megaraid device driver. The call stack ends with:
[<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox]
megaraid_isr is likely the megaraid Interrupt Service Routine. The null pointer derreference occured in journal_commit_transaction(). It likely that the megaraid_isr driver did not initialize some pointer properly and journal_commit_transaction() eventually stepped on it. Could be just a simple bug, or a race condition or, and this is my guess, the driver does not properly support whatever megaraid chipset is on your system. All that said I have not looked at the code, so this is hypothis based on the panic output.
If I where you, I would load up RHEL 4 (not centos), reproduce the bug and create a bugzilla report at bugzilla.redhat.com. The reason I suggest loading up rhel4 is just in case the kernel symbol locations are not exactly the same between the centos kernel and the rhel4 kernel which will hinder your support from RedHat engineers.
Cheers...james
On Wed, 2005-04-13 at 07:47 -0400, James Olin Oden wrote:
This is not necessarily a problem with your hardware but could be a bonified bug in the megaraid device driver.
Looking at the changelog for the new kernel (from 2.6.9-5.0.3.EL up to 2.6.9-6.37.EL), there are several megaraid and/or scsi device driver changes ... may be fixed w/the new kernel.
Looking at the changelog for the new kernel (from 2.6.9-5.0.3.EL up to 2.6.9-6.37.EL), there are several megaraid and/or scsi device driver changes ... may be fixed w/the new kernel.
Is the SRPM of this new kernel 2.6.9-6.37.ELavailable from the RH site? Or das centos plan to release a beta of Version 4.1 to?
Thank you very much
hansjörg
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos