[CentOS] CentOS-5.8 - Problem booting remote host

Wed Jun 20 16:36:47 UTC 2012

Kernel 2.6.18-308.8.2.el5

I recently experienced an odd problem with a host at our warm-site
location.  The facility we use suffered an hvac failure during
elevated ambient temperatures (30C+) on Monday past and the equipment
room reportedly cooked for some hours.  It was sufficient that our
equipment shutdown. In all probability this was due to an over-temp
condition since the systems are all powered from a UPS but possibly
there was an extended power out instead.

Whatever the cause one of our hosts did not restart subsequent to this
shutdown.  Which condition required my presence on site. When it was
powered up today in situ the host's console would display the CentOS
splash screen with the message [press any key to enter menu] and then
the message "Booting in ....4 seconds"

However, the countdown timer never changed from the initial value and
the restart never took place.  When I entered the console menu and
selected the most recent kernel available the system booted normally. 
It had to do a lot of disk remediation on the first go through but all
that completed without untoward difficulty.  Subsequent shutdowns
displayed the same behaviour.  The splash screen displayed, the boot
timer message showed, and then nothing changed thereafter unless and
until I entered the boot menu.

Selecting the default kernel in the boot menu allowed the restart to
continue, this time without any unusual reports.  I repeated this
process several times more just to confirm that this was not a
transient effect.  Each time operator intervention from the console
was required to restart the system but once this was done no further
problems were noted.  I repeated the process and rebooted using each
of the older kernels present. As far as I could determine there was
nothing wrong with any of the boot images once past the auto-select
segment of the boot process.

I then went into /boot/grub.conf and changed the default boot from
index 0 to index 1 so as to use the previous kernel.  Following this
configuration change thereafter the system restarted normally without
operator intervention.

The problem kernel was installed from Updates on June 13 and was
running from that date as shown in the log entries below.  This was a
remote restart and evidently it completed without any problem.

Jun 13 10:15:59 inet04 shutdown[19274]: shutting down for system reboot
. . .
Jun 13 10:16:25 inet04 exiting on signal 15
Jun 13 10:18:35 inet04 syslogd 1.4.1: restart.
Jun 13 10:18:35 inet04 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Jun 13 10:18:35 inet04 kernel: Linux version 2.6.18-308.8.2.el5
(mockbuild at build
er10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)) #1
SMP Tue Jun
12 09:57:26 EDT 2012

The current syslog shows repeated restarts commencing at 18:12 on Jun
17 and ending at 19:08, after which the system no longer records any
activity whatsoever until I restarted it manually earlier today. 
These two syslogd entries are adjacent in /var/log/messages

Jun 18 19:08:48 inet04 kernel: PROBE_BLACKIST: IN=eth0 OUT=
MAC=00:1c:c4:a1:66:1e:00:18:73:e8:35:a1:08:00 SRC=126.67.126.141
DST=209.47.176.105 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=56792 DF
PROTO=TCP SPT=2153 DPT=445 WINDOW=64240 RES=0x00 SYN URGP=0

Jun 20 12:11:31 inet04 ntpd[2480]: time reset +147721.498633 s

Have any of you ever experienced anything like this?  Does anyone have
any idea what might have caused the corruption of the restart
mechanism or where the problem might be?

-- 
***          E-Mail is NOT a SECURE channel          ***
James B. Byrne                mailto:ByrneJB at Harte-Lyne.ca
Harte & Lyne Limited          http://www.harte-lyne.ca
9 Brockley Drive              vox: +1 905 561 1241
Hamilton, Ontario             fax: +1 905 561 0757
Canada  L8E 3C3