On 3/24/2011 4:38 PM, Dr. Ed Morbius wrote:
Dave:
on 16:03 Thu 24 Mar, Windsor Dave L. (AdP/TEF7.1) (Dave.Windsor@us.bosch.com) wrote:
Hello Everyone,
Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception
This suggests that something happened in a Samba process.
Correct.
If this is regularly happening in Samba, that would point to a problem with your samba config (either on that host, something remotely stuffing bad packets at you, or likley in that case, both, as bad data shouldn't crash the host).
I can have have network analyst monitor the ports for unusual bursts of traffic, although that might not catch small amounts of strange data.
If this is happening in different programs over time, then the problem is likely /not/ software, but hardware/firmware.
The LKML may be able to help you on your panic; please read their bug posting guidelines /BEFORE/ posting.
I have the Samba3x packages installed since we are beginning to introduce Win7 clients into our environment.
What happens if you take the Win7 clients away?
Googling "Kernel panic - not syncing: Fatal exception" and "CentOS"
That is the generic kernel panic message. It's going to be spectacularly unspecific.
produced many hits, but nothing that seemed to exactly match my problem. Since this is the only G7 server I have here right now, I can't reproduce the problem on another machine. The G6s I have running the identical version of CentOS have no problems.
I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel
- is this a good idea?
Dell have had numerous issues with recent server editions, it's possible HP are as well:
- If you haven't, configure the netconsole kernel module for kernel-enabled network logging of panics.
This is a great idea. I will work on that soonest.
Call HP and find out what the latest recommended BIOS and firmware upgrades for your system are. C-STATE has been a particular issue with Dell, and its' been disabled entirely in recent BIOS versions. I see below you've updated BIOS.
Scan logs for other messages, particularly panics and/or ECC issues.
I haven't seen anything ominous, although I have noticed a long time gap between the last entry in /var/log/messages and the actual crash. Such a gap in entries is very unusual.
If you can stand the downtime, run memtest86+ at least overnight on your RAM. A reboot indicates a failed test.
Otherwise: try running with half your RAM swapped.
Check/reseat all DIMMs, sockets, and cables. Some folks caution against this on the basis of connector wear, but if you've got a problem, this may help resolve it, and I've seen boxes shipped with components poorly or even un-cabled.
We have one DIMM of 4 GB RAM, so I can't swap it out or run with half. I have reseated it and inspected the contacts, and it looks OK. I will look at anything else with connectors.
- Does a similarly equipped system exhibit the same problems?
The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.
Ugh. Broadcom's gotten better but I prefer Intel NICs. Can't speak to the others. And OK, you've updated BIOS.
Thanks for your help!
Best Regards,
Dave Windsor
Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us
Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com