Re: [CentOS] Kernel Panic on HP/Compaq ProLiant G7

24 Mar 2011

      Dave:
on 16:03 Thu 24 Mar, Windsor Dave L. (AdP/TEF7.1) (Dave.Windsor@us.bosch.com) wrote:
...
Hello Everyone,
I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380
G7.  I have identical OS software running reock-solid on two other
DL380 ProLiant servers, but they are G6 models, not G7.  On the G7,
the installation went perfectly and the machine ran great for about 2
weeks, when it just seemed to "stop".  The system stopped responding
on the network, and there was no video on the console (or remote
console via iLO).  It would not reboot or cold boot through iLO, I
actually had to hold the power to turn it off and then hit it again to
power up.
This happened several times within a few days of each other.  Each
time, there was no evidence in any logs of a problem - the system just
seemed to stop or lock up.   We did have a CPU problem light appear on
the front, so HP came in and replaced the one 4-core CPU.  Since then,
it has run as long as two weeks, but still crashes randomly.  After
the last reboot, I left the console in text mode on vt1, and when it
crashed again this morning this was displayed on the screen:
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8100dc435cf0  CR3: 000000008a6ca000 CR4: 00000000000006e0
Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0)
Stack:  ffff81011e4e71c0 0000000000000000 ffff8100cf12a015 ffffffff80009c41
 ffff81011e4e71c0 0000000100000000 000000030027ea9d ffff8100cf12a011
 ffff81011e4e71c0 ffff81010d9cf300 ffff81011e4e71c0 ffff8101044099c0
Call Trace:
 [<ffffffff80009c41>] __link_path_walk+0x3a6/0xf5b
 [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
 [<ffffffff80012851>] getname+0x15b/0x1c2
 [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c
 [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a
 [<ffffffff80039fa2>] fcntl_setlk+0x243/0x273
 [<ffffffff80023703>] sys_newstat+0x19/0x31
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc
RIP  [<ffff8100dc435cf0>]
 RSP <ffff81001529fd18>
CR2: ffff8100dc435cf0
 <0>Kernel panic - not syncing: Fatal exception
This suggests that something happened in a Samba process.
Correct.
If this is regularly happening in Samba, that would point to a problem
with your samba config (either on that host, something remotely stuffing
bad packets at you, or likley in that case, both, as bad data shouldn't
crash the host).
If this is happening in different programs over time, then the problem
is likely /not/ software, but hardware/firmware.
The LKML may be able to help you on your panic; please read their bug
posting guidelines /BEFORE/ posting.
...
I have the Samba3x packages installed since we are beginning to
introduce Win7 clients into our environment.
What happens if you take the Win7 clients away?
...
Googling "Kernel panic - not syncing: Fatal exception" and "CentOS"
That is the generic kernel panic message.  It's going to be
spectacularly unspecific.
...
produced many hits, but nothing that seemed to exactly match my
problem.  Since this is the only G7 server I have here right now, I
can't reproduce the problem on another machine.  The G6s I have
running the identical version of CentOS have no problems.
I am trying to determine if this is pointing to a hardware or software
issue.  Some of the Google results suggested using a Centosplus kernel

is this a good idea?

Dell have had numerous issues with recent server editions, it's possible
HP are as well:
- If you haven't, configure the netconsole kernel module for
   kernel-enabled network logging of panics.
- Call HP and find out what the latest recommended BIOS and firmware
   upgrades for your system are.  C-STATE has been a particular issue
   with Dell, and its' been disabled entirely in recent BIOS versions.
   I see below you've updated BIOS.
- Scan logs for other messages, particularly panics and/or ECC issues.
- If you can stand the downtime, run memtest86+ at least overnight on
   your RAM.  A reboot indicates a failed test.
- Otherwise: try running with half your RAM swapped.
- Check/reseat all DIMMs, sockets, and cables.  Some folks caution
   against this on the basis of connector wear, but if you've got a
   problem, this may help resolve it, and I've seen boxes shipped with
   components poorly or even un-cabled.
- Does a similarly equipped system exhibit the same problems?
...
The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz),
one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709
II Gigabit Ethernet" NICs, and a P410 Smart Array Controller.  The
P410 and the system BIOS have both been updated to the latest levels
to see if that fixes the crashes, with no change.
Ugh.  Broadcom's gotten better but I prefer Intel NICs.  Can't speak to
the others.  And OK, you've updated BIOS.
-- 
Dr. Ed Morbius, Chief Scientist /            |
  Robot Wrangler / Staff Psychologist        | When you seek unlimited power
Krell Power Systems Unlimited                |                  Go to Krell!

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [CentOS] Kernel Panic on HP/Compaq ProLiant G7