On Tue, 31 Aug 2010, fred smith wrote:
To: CentOS mailing list centos@centos.org From: fred smith fredex@fcshome.stoneham.ma.us Subject: Re: [CentOS] Centos 5.5, not booting latest kernel but older one instead
On Tue, Aug 31, 2010 at 08:18:26AM -0400, Robert Heller wrote:
At Mon, 30 Aug 2010 22:24:19 -0400 CentOS mailing list centos@centos.org wrote:
On Mon, Aug 30, 2010 at 08:41:31PM -0500, Larry Vaden wrote:
On Mon, Aug 30, 2010 at 8:18 PM, fred smith fredex@fcshome.stoneham.ma.us wrote:
Below is some info that shows the problem. Can anyone here provide helpful suggestions on (1) why it is doing this, and more importantly (2) how I can make it stop?
Is there a chance /boot is full (read: are all the menu'd kernels actually present in /boot)?
(IIRC something similar happened because the /boot partition was set at the recommended size of 100 MB).
/boot doesn't appear to be full, there appear to be 25.2 megabytes free with 20.1 available.
another curious thing I just noticed is this: the list of kernels available at boot time (in the actual grub menu shown at boot) IS NOT THE SAME LIST THAT APPEARS IN GRUB.CONF. in the boot-time menu, the kernel it boots is the most recent one shown, and there are other older ones that do not appear in grub.conf. while in grub.conf there are several newer ones that do not appear on the boot-time grub menu.
most strange.
BTW, this is a raid-1 array using linux software raid, with two matching drives. Is there possibly some way the two drives could have gotten out of sync such that whichever one is the actual boot device has invalid info in /boot?
and while thinking along those lines, I see a number of mails in root's mailbox from "md" notifying us of a degraded array. these all appear to have happened, AFAICT, at system boot, over the last several months.
also, /var/log/messages contains a bunch of stuff like the below, also apparently at system boot, and I don't really know what it means, though the lines mentining a device being "kicked out" seem ominous:
Aug 30 22:09:08 fcshome kernel: device-mapper: uevent: version 1.0.3 Aug 30 22:09:08 fcshome kernel: device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel@redhat.com Aug 30 22:09:08 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l Aug 30 22:09:08 fcshome kernel: md: Autodetecting RAID arrays. Aug 30 22:09:08 fcshome kernel: md: autorun ... Aug 30 22:09:08 fcshome kernel: md: considering sdb2 ... Aug 30 22:09:08 fcshome kernel: md: adding sdb2 ... Aug 30 22:09:08 fcshome kernel: md: sdb1 has different UUID to sdb2 Aug 30 22:09:08 fcshome kernel: md: adding sda2 ... Aug 30 22:09:08 fcshome kernel: md: sda1 has different UUID to sdb2 Aug 30 22:09:08 fcshome kernel: md: created md1 Aug 30 22:09:08 fcshome kernel: md: bind<sda2> Aug 30 22:09:08 fcshome kernel: md: bind<sdb2> Aug 30 22:09:08 fcshome kernel: md: running: <sdb2><sda2> Aug 30 22:09:08 fcshome kernel: md: kicking non-fresh sda2 from array! Aug 30 22:09:08 fcshome kernel: md: unbind<sda2> Aug 30 22:09:08 fcshome kernel: md: export_rdev(sda2) Aug 30 22:09:08 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirro rs Aug 30 22:09:08 fcshome kernel: md: considering sdb1 ... Aug 30 22:09:08 fcshome kernel: md: adding sdb1 ... Aug 30 22:09:08 fcshome kernel: md: adding sda1 ... Aug 30 22:09:08 fcshome kernel: md: created md0 Aug 30 22:09:08 fcshome kernel: md: bind<sda1> Aug 30 22:09:08 fcshome kernel: md: bind<sdb1> Aug 30 22:09:08 fcshome kernel: md: running: <sdb1><sda1> Aug 30 22:09:08 fcshome kernel: md: kicking non-fresh sda1 from array! Aug 30 22:09:08 fcshome kernel: md: unbind<sda1> Aug 30 22:09:08 fcshome kernel: md: export_rdev(sda1) Aug 30 22:09:08 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirro rs Aug 30 22:09:08 fcshome kernel: md: ... autorun DONE.
It looks like there is something wrong with sda... Your BIOS is booting grub from sda, grub is loading its conf, etc. from sda, but sda is not part of your raid sets of your running system. Your newer kernels are landing on sdb...
yeah, that sounds like a possibility.
I *think* you can fix this by using mdadm to add (mdadm --add ...) sda and make it rebuild sda1 and sda2 from sdb1 and sdb2. You mav have to --fail and --remove it first.
I think you may be right. I'll give that a whirl at first opportunity.
After posting this last night I did further digging and found that the particular drives I'm using in this raid array are known to have long timeouts, causing raid controllers (though I don't know if that includes Linux's software RAID or not) to become confused and fail the mirror/ drive when the timeout gets too long. There's apparently a WD utility (these are WD drives) to change a setting for that (the utility is wdtler.exe and the drive property is called TLER) which allegedly solves the problem. Other posters have pointed out that the newer drives OF THE SAME MODEL no longer let you set that. I haven't yet had the chance to find out if my drives allow it to be changed or not, but since they're somewhat over a year old I'm hopeful. Soon, I hope. Looks like I need to find a way to make a DOS bootable floppy (then add a floppy drive to the machine) so I can boot it up and give it a try.
Poking around with smartctl indicates NO drive errors on either drive, so I'm hopeful that the problem is "simply" as described above.
If I can't change the setting I may have to replace the drives. :( the entire REASON for buying two drives was so I would have some safety. doggone drive manufacturers!
Hi Fred. Somewhat OT but maybe of interest to you.
I had to replace some WD drives after 3 years use.
One kept giving out SMART messages which I ignored, till the drive went AWOL. The other had no SMART error messages whatsoever.
That went down as well!
So I'm on Hitachi HDD now.
Reason being I had, and still have a Hitachi 2.5" drive in one of my laptops. The SMART test report looks really bad.
The drive makes ominous clunking sounds wnen in use. I have been expecting it to fail for some time, but it just keeps going!
Hitachi provide a DOS test program for their hard drives.
http://www.hitachigst.com/support/downloads/#DFT
The specs for the Deskstar looked good. Plus they have a 3 year warranty.
[root@karsites ~]# smartctl -a /dev/sda smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION === Model Family: Hitachi Deskstar P7K500 series Device Model: Hitachi HDP725050GLAT80 Serial Number: xxxxxxxxxxxxx Firmware Version: GM4OA42A User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Tue Aug 31 15:29:30 2010 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled
Keith
-
---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- The eyes of the Lord are everywhere, keeping watch on the wicked and the good. ----------------------------- Proverbs 15:3 (niv) ----------------------------- _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos