[CentOS] Kernel errors, resulting in system crashes

Tue May 23 10:42:41 UTC 2006
William L. Maltby <BillsCentOS at triad.rr.com>

On Tue, 2006-05-23 at 08:27 +0000, Henri Cook wrote:
> 
> Hi Guys,
> 
> <snip>

> I run a webserver (remotely administered, no graphical user interface that
> i know of - certainly not one running all the time) - in the past few days
> this error has started to occur, the system dies and needs manually
> rebooting after each occurence:
> 
> ----------------
> 
> May 21 23:08:01 testmachine kernel: Page has mapping still set. This is a
> serious situation. However if you
> May 21 23:08:01 testmachine kernel: are using the NVidia binary only
> module please report this bug to
> 
> -----------------
> 
> The machine runs CentOS 3 and has been for ~2 Years, a technician at the
> datacentre ran Memtest earlier today which returned no errors.
> 
> Hardware wise it's an intel 2.53Ghz Celeron, 1.5GB RAM
> 
> I would really appreciate any help!

I don't know if I'll be any help, but I'm awake already and a lot of
folks aren't!  :-))

1) Any changes recently in software, hardware, location, computer room
   re-arranged?

I ask because you say it's running about two years. Since you list no
changes and it just starts happening, I thought I'd lead you down that
path.

For software,

    rpm -qa --last

will give a nice list, descending on install date, with timestamps. Like
this.

    wine-jack-0.9.12-1.el4.kb           Sun 14 May 2006 04:34:15 PM EDT
    wine-cms-0.9.12-1.el4.kb            Sun 14 May 2006 04:34:15 PM EDT
    wine-tools-0.9.12-1.el4.kb          Sun 14 May 2006 04:34:14 PM EDT
    wine-devel-0.9.12-1.el4.kb          Sun 14 May 2006 04:34:08 PM EDT

That'll tell about software changes if you use yum or rpm or update (I
guess).

Last thing for software is to use the rpm query with the --checksig
option to see if there has been corruption. Remember that certain
changes are expected but you need to check each flagged file to be sure
it is not an executable that unexpectedly changed due to corruption.

Your tech in the center should be able to answer the other three Qs.

The reason for the hardware Q is that my experience indicates that a
couple of years of steady-state running that suddenly breaks, when no
changes in software have been made, is most often caused by something
that "jiggled" the hardware. Sometimes the jiggle was equivalent to a
Richter scale 7.6, sometimes much less.

Regardless, parts can work loose even if not jiggled. Small thermal
changes constantly occurring over time tend to unseat some components
sometimes. Ask your hardware guy to open the case next time it goes down
and re-seat all components (memory sticks, PCI cards, ISA cards, ...)
and then check tightness of easily accessible screws. When he buttons it
back up, check/re-seat cables, check accessible screws.

That's a starting point.

Have you checked the logs to see if any additional error messages are
available? Maybe what you see is the end result, not the cause? Anyway,
for the times it crashed, any common conditions? Every X hours? Logs
show certain program recently started and running when it happens? Etc.

Do you have an nvidia card there? If no other culprits appear, try
replacing it. Maybe it's gone bad?

I'm sure other folks will suggest the kernel debugging be enabled and
that can give strong clues to those who know how to use the results.

> 
> Henri
> <snip sig stuff>

HTH
-- 
Bill
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.centos.org/pipermail/centos/attachments/20060523/b81b4cbe/attachment-0005.sig>