[CentOS] Kernel Panic on HP/Compaq ProLiant G7

Thu Mar 24 21:05:14 UTC 2011
Dr. Ed Morbius <dredmorbius at gmail.com>

on 16:56 Thu 24 Mar, Windsor Dave L. (AdP/TEF7) (Dave.Windsor at us.bosch.com) wrote:
> On 3/24/2011 4:38 PM, Dr. Ed Morbius wrote:
> > Dave:
> >
> > on 16:03 Thu 24 Mar, Windsor Dave L. (AdP/TEF7.1) (Dave.Windsor at us.bosch.com) wrote:
> >> Hello Everyone,
> >>
> >> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc
> >> RIP  [<ffff8100dc435cf0>]
> >>   RSP<ffff81001529fd18>
> >> CR2: ffff8100dc435cf0
> >>   <0>Kernel panic - not syncing: Fatal exception
> >>
> >>
> >> This suggests that something happened in a Samba process.

<...>

> >   - If you haven't, configure the netconsole kernel module for
> >     kernel-enabled network logging of panics.
>        This is a great idea.  I will work on that soonest.

It really is about four times as cool as it sounds.  Getting the actual
panic is hugely useful.

> >   - Call HP and find out what the latest recommended BIOS and firmware
> >     upgrades for your system are.  C-STATE has been a particular issue
> >     with Dell, and its' been disabled entirely in recent BIOS versions.
> >     I see below you've updated BIOS.
> >
> >   - Scan logs for other messages, particularly panics and/or ECC issues.
>        I haven't seen anything ominous, although I have noticed a long 
> time gap between the last entry in /var/log/messages and the actual 
> crash.  Such a gap in entries is very unusual.

You can create a "timestamp" cron job.  Just a 

    */10 * * * * root Logger "--- TIMESTAMP ---"

... entry.  At least you'll see any long dry periods.

sar is also a useful utility to look at.  It should be recording and
reporting systems state and resource utilization levels prior to the
crash.

> >   - If you can stand the downtime, run memtest86+ at least overnight on
> >     your RAM.  A reboot indicates a failed test.
> >
> >   - Otherwise: try running with half your RAM swapped.
> >
> >   - Check/reseat all DIMMs, sockets, and cables.  Some folks caution
> >     against this on the basis of connector wear, but if you've got a
> >     problem, this may help resolve it, and I've seen boxes shipped with
> >     components poorly or even un-cabled.
>        We have one DIMM of 4 GB RAM, so I can't swap it out or run with 
> half.  I have reseated it and inspected the contacts, and it looks OK. 
> I will look at anything else with connectors.

Actually, you can.  Setting 'mem=2G' at your boot prompt will cue the
kernel to use only half the RAM.  Now, you can't specify an offset to
use the high half, unfortunately.  You could also swap the DIMM with
another system if you've got it and see if you still have the problems
in this one (or start seeing them in the other).
 
-- 
Dr. Ed Morbius, Chief Scientist /            |
  Robot Wrangler / Staff Psychologist        | When you seek unlimited power
Krell Power Systems Unlimited                |                  Go to Krell!