Kernel errors, resulting in system crashes

List overview All Threads
Download

newer

older

ISO-8601 and LOCALEs

ipsec and Centos3.7

Henri Cook

23 May 2006 23 May '06

8:27 a.m.

Hi Guys,

After a chat with some helpful people on Freenode I've included as much information as I can think to about this error, if you need anything else please don't hesitate to ask.

I run a webserver (remotely administered, no graphical user interface that i know of - certainly not one running all the time) - in the past few days this error has started to occur, the system dies and needs manually rebooting after each occurence:

----------------

May 21 23:08:01 testmachine kernel: Page has mapping still set. This is a serious situation. However if you May 21 23:08:01 testmachine kernel: are using the NVidia binary only module please report this bug to

-----------------

The machine runs CentOS 3 and has been for ~2 Years, a technician at the datacentre ran Memtest earlier today which returned no errors.

Hardware wise it's an intel 2.53Ghz Celeron, 1.5GB RAM

I would really appreciate any help!

Henri

Show replies by date

William L. Maltby

23 May 23 May

10:42 a.m.

On Tue, 2006-05-23 at 08:27 +0000, Henri Cook wrote:

...

Hi Guys,

<snip>

...

I run a webserver (remotely administered, no graphical user interface that i know of - certainly not one running all the time) - in the past few days this error has started to occur, the system dies and needs manually rebooting after each occurence:

May 21 23:08:01 testmachine kernel: Page has mapping still set. This is a serious situation. However if you May 21 23:08:01 testmachine kernel: are using the NVidia binary only module please report this bug to

The machine runs CentOS 3 and has been for ~2 Years, a technician at the datacentre ran Memtest earlier today which returned no errors.

Hardware wise it's an intel 2.53Ghz Celeron, 1.5GB RAM

I would really appreciate any help!

I don't know if I'll be any help, but I'm awake already and a lot of folks aren't! :-))

1) Any changes recently in software, hardware, location, computer room re-arranged?

I ask because you say it's running about two years. Since you list no changes and it just starts happening, I thought I'd lead you down that path.

For software,

rpm -qa --last

will give a nice list, descending on install date, with timestamps. Like this.

wine-jack-0.9.12-1.el4.kb Sun 14 May 2006 04:34:15 PM EDT wine-cms-0.9.12-1.el4.kb Sun 14 May 2006 04:34:15 PM EDT wine-tools-0.9.12-1.el4.kb Sun 14 May 2006 04:34:14 PM EDT wine-devel-0.9.12-1.el4.kb Sun 14 May 2006 04:34:08 PM EDT

That'll tell about software changes if you use yum or rpm or update (I guess).

Last thing for software is to use the rpm query with the --checksig option to see if there has been corruption. Remember that certain changes are expected but you need to check each flagged file to be sure it is not an executable that unexpectedly changed due to corruption.

Your tech in the center should be able to answer the other three Qs.

The reason for the hardware Q is that my experience indicates that a couple of years of steady-state running that suddenly breaks, when no changes in software have been made, is most often caused by something that "jiggled" the hardware. Sometimes the jiggle was equivalent to a Richter scale 7.6, sometimes much less.

Regardless, parts can work loose even if not jiggled. Small thermal changes constantly occurring over time tend to unseat some components sometimes. Ask your hardware guy to open the case next time it goes down and re-seat all components (memory sticks, PCI cards, ISA cards, ...) and then check tightness of easily accessible screws. When he buttons it back up, check/re-seat cables, check accessible screws.

That's a starting point.

Have you checked the logs to see if any additional error messages are available? Maybe what you see is the end result, not the cause? Anyway, for the times it crashed, any common conditions? Every X hours? Logs show certain program recently started and running when it happens? Etc.

Do you have an nvidia card there? If no other culprits appear, try replacing it. Maybe it's gone bad?

I'm sure other folks will suggest the kernel debugging be enabled and that can give strong clues to those who know how to use the results.

...

Henri

<snip sig stuff>

HTH

-- Bill

Henri Cook

1:22 p.m.

Thanks for getting back to me, both replies copied below, i'll try to answer both of them.

I'm running kernel 2.4.21-40.EL - should have mentioned that before!

rpm -qa --last outputs:

libtiff-3.5.7-25.el3.1 Sun 21 May 2006 23:19:10 GMT ethereal-0.99.0-EL3.2 Sun 21 May 2006 23:19:08 GMT glib-1.2.10-11.1 Sun 21 May 2006 23:19:01 GMT kernel-utils-2.4-8.37.14 Wed 26 Apr 2006 09:48:00 GMT kernel-pcmcia-cs-3.1.31-19 Wed 26 Apr 2006 09:47:59 GMT kernel-2.4.21-40.EL Wed 26 Apr 2006 09:47:51 GMT

So the kernel was upgraded *relatively* recently, but this is the first time i've noticed any errors

I run up2date/yum regularly (using 'up2date'). I don't use any web control panel software.

Unfortunately the datacentre charges 'remote hands' fees of £35 per half hour (!) so running memtest at regular intervals is probably not the hottest idea, it's also a production machine so i'm quite hestitant to take it down for prolonged periods.

As far as i'm aware the machine hasn't been moved, i've only updated things through up2date

The problem hasn't reoccured in the last twenty four hours, I *was* seeing some glib errors but a package update issued recently may have fixed that - any more suggestions would be much appreciated - if this glib thing has fixed it (if it doesn't crash in the next 48 hours) - i'll mail the list.

Kindest,

Henri

hey Henri,

Henri Cook wrote:

...

I run a webserver (remotely administered, no graphical user interface that i know of - certainly not one running all the time) - in the past few days this error has started to occur, the system dies and needs manually rebooting after each occurence:

you forgot to mention what kernel you are running and when was the last time you yum updated the machine ? are you also running some sort of web control panel that might ( bad idea ) be suppressing packages from being updated ?

while the machine is running - check lsmod, to see if any strange modules are being loaded ?

...

The machine runs CentOS 3 and has been for ~2 Years, a technician at the datacentre ran Memtest earlier today which returned no errors.

normally you would run memtest for a while ( overnight 12 - 16 hr cycles work best ) before you have a real result.

- KB

...

On Tue, 2006-05-23 at 08:27 +0000, Henri Cook wrote:

...
Hi Guys,

<snip>

...
I run a webserver (remotely administered, no graphical user interface that i know of - certainly not one running all the time) - in the past few days this error has started to occur, the system dies and needs manually rebooting after each occurence:

May 21 23:08:01 testmachine kernel: Page has mapping still set. This is a serious situation. However if you May 21 23:08:01 testmachine kernel: are using the NVidia binary only module please report this bug to

The machine runs CentOS 3 and has been for ~2 Years, a technician at the datacentre ran Memtest earlier today which returned no errors.

Hardware wise it's an intel 2.53Ghz Celeron, 1.5GB RAM

I would really appreciate any help!

I don't know if I'll be any help, but I'm awake already and a lot of folks aren't! :-))

Any changes recently in software, hardware, location, computer room re-arranged?

I ask because you say it's running about two years. Since you list no changes and it just starts happening, I thought I'd lead you down that path.

For software,
rpm -qa --last
will give a nice list, descending on install date, with timestamps. Like this.
wine-jack-0.9.12-1.el4.kb           Sun 14 May 2006 04:34:15 PM EDT
wine-cms-0.9.12-1.el4.kb            Sun 14 May 2006 04:34:15 PM EDT
wine-tools-0.9.12-1.el4.kb          Sun 14 May 2006 04:34:14 PM EDT
wine-devel-0.9.12-1.el4.kb          Sun 14 May 2006 04:34:08 PM EDT
That'll tell about software changes if you use yum or rpm or update (I guess).

Last thing for software is to use the rpm query with the --checksig option to see if there has been corruption. Remember that certain changes are expected but you need to check each flagged file to be sure it is not an executable that unexpectedly changed due to corruption.

Your tech in the center should be able to answer the other three Qs.

The reason for the hardware Q is that my experience indicates that a couple of years of steady-state running that suddenly breaks, when no changes in software have been made, is most often caused by something that "jiggled" the hardware. Sometimes the jiggle was equivalent to a Richter scale 7.6, sometimes much less.

Regardless, parts can work loose even if not jiggled. Small thermal changes constantly occurring over time tend to unseat some components sometimes. Ask your hardware guy to open the case next time it goes down and re-seat all components (memory sticks, PCI cards, ISA cards, ...) and then check tightness of easily accessible screws. When he buttons it back up, check/re-seat cables, check accessible screws.

That's a starting point.

Have you checked the logs to see if any additional error messages are available? Maybe what you see is the end result, not the cause? Anyway, for the times it crashed, any common conditions? Every X hours? Logs show certain program recently started and running when it happens? Etc.

Do you have an nvidia card there? If no other culprits appear, try replacing it. Maybe it's gone bad?

I'm sure other folks will suggest the kernel debugging be enabled and that can give strong clues to those who know how to use the results.

...
Henri

<snip sig stuff>

HTH

Bill _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Karanbir Singh

12:02 p.m.

hey Henri,

Henri Cook wrote:

...

I run a webserver (remotely administered, no graphical user interface that i know of - certainly not one running all the time) - in the past few days this error has started to occur, the system dies and needs manually rebooting after each occurence:

while the machine is running - check lsmod, to see if any strange modules are being loaded ?

...

The machine runs CentOS 3 and has been for ~2 Years, a technician at the datacentre ran Memtest earlier today which returned no errors.

normally you would run memtest for a while ( overnight 12 - 16 hr cycles work best ) before you have a real result.

- KB

-- Karanbir Singh : http://www.karan.org/ : 2522219@icq

7236

Age (days ago)

7236

Last active (days ago)

discuss@lists.centos.org

3 comments

3 participants

tags (0)

participants (3)

Henri Cook
Karanbir Singh
William L. Maltby