Kernel 2.6.18-308.8.2.el5
I recently experienced an odd problem with a host at our warm-site location. The facility we use suffered an hvac failure during elevated ambient temperatures (30C+) on Monday past and the equipment room reportedly cooked for some hours. It was sufficient that our equipment shutdown. In all probability this was due to an over-temp condition since the systems are all powered from a UPS but possibly there was an extended power out instead.
Whatever the cause one of our hosts did not restart subsequent to this shutdown. Which condition required my presence on site. When it was powered up today in situ the host's console would display the CentOS splash screen with the message [press any key to enter menu] and then the message "Booting in ....4 seconds"
However, the countdown timer never changed from the initial value and the restart never took place. When I entered the console menu and selected the most recent kernel available the system booted normally. It had to do a lot of disk remediation on the first go through but all that completed without untoward difficulty. Subsequent shutdowns displayed the same behaviour. The splash screen displayed, the boot timer message showed, and then nothing changed thereafter unless and until I entered the boot menu.
Selecting the default kernel in the boot menu allowed the restart to continue, this time without any unusual reports. I repeated this process several times more just to confirm that this was not a transient effect. Each time operator intervention from the console was required to restart the system but once this was done no further problems were noted. I repeated the process and rebooted using each of the older kernels present. As far as I could determine there was nothing wrong with any of the boot images once past the auto-select segment of the boot process.
I then went into /boot/grub.conf and changed the default boot from index 0 to index 1 so as to use the previous kernel. Following this configuration change thereafter the system restarted normally without operator intervention.
The problem kernel was installed from Updates on June 13 and was running from that date as shown in the log entries below. This was a remote restart and evidently it completed without any problem.
Jun 13 10:15:59 inet04 shutdown[19274]: shutting down for system reboot . . . Jun 13 10:16:25 inet04 exiting on signal 15 Jun 13 10:18:35 inet04 syslogd 1.4.1: restart. Jun 13 10:18:35 inet04 kernel: klogd 1.4.1, log source = /proc/kmsg started. Jun 13 10:18:35 inet04 kernel: Linux version 2.6.18-308.8.2.el5 (mockbuild@build er10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)) #1 SMP Tue Jun 12 09:57:26 EDT 2012
The current syslog shows repeated restarts commencing at 18:12 on Jun 17 and ending at 19:08, after which the system no longer records any activity whatsoever until I restarted it manually earlier today. These two syslogd entries are adjacent in /var/log/messages
Jun 18 19:08:48 inet04 kernel: PROBE_BLACKIST: IN=eth0 OUT= MAC=00:1c:c4:a1:66:1e:00:18:73:e8:35:a1:08:00 SRC=126.67.126.141 DST=209.47.176.105 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=56792 DF PROTO=TCP SPT=2153 DPT=445 WINDOW=64240 RES=0x00 SYN URGP=0
Jun 20 12:11:31 inet04 ntpd[2480]: time reset +147721.498633 s
Have any of you ever experienced anything like this? Does anyone have any idea what might have caused the corruption of the restart mechanism or where the problem might be?