On Tue, 31 Aug 2010, fred smith wrote: > To: CentOS mailing list <centos at centos.org> > From: fred smith <fredex at fcshome.stoneham.ma.us> > Subject: Re: [CentOS] Centos 5.5, > not booting latest kernel but older one instead > > On Tue, Aug 31, 2010 at 08:18:26AM -0400, Robert Heller wrote: >> At Mon, 30 Aug 2010 22:24:19 -0400 CentOS mailing list <centos at centos.org> wrote: >> >>> >>> On Mon, Aug 30, 2010 at 08:41:31PM -0500, Larry Vaden wrote: >>>> On Mon, Aug 30, 2010 at 8:18 PM, fred smith >>>> <fredex at fcshome.stoneham.ma.us> wrote: >>>>> >>>>> Below is some info that shows the problem. Can anyone here provide >>>>> helpful suggestions on (1) why it is doing this, and more importantly (2) >>>>> how I can make it stop? >>>> >>>> Is there a chance /boot is full (read: are all the menu'd kernels >>>> actually present in /boot)? >>>> >>>> (IIRC something similar happened because the /boot partition was set >>>> at the recommended size of 100 MB). >>> >>> /boot doesn't appear to be full, there appear to be 25.2 megabytes free with 20.1 available. >>> >>> another curious thing I just noticed is this: the list of kernels available >>> at boot time (in the actual grub menu shown at boot) IS NOT THE SAME LIST >>> THAT APPEARS IN GRUB.CONF. in the boot-time menu, the kernel it boots is >>> the most recent one shown, and there are other older ones that do not >>> appear in grub.conf. while in grub.conf there are several newer ones that >>> do not appear on the boot-time grub menu. >>> >>> most strange. >>> >>> BTW, this is a raid-1 array using linux software raid, with two matching >>> drives. Is there possibly some way the two drives could have gotten out >>> of sync such that whichever one is the actual boot device has invalid >>> info in /boot? >>> >>> and while thinking along those lines, I see a number of mails in root's >>> mailbox from "md" notifying us of a degraded array. these all appear to have >>> happened, AFAICT, at system boot, over the last several months. >>> >>> also, /var/log/messages contains a bunch of stuff like the below, also >>> apparently at system boot, and I don't really know what it means, though >>> the lines mentining a device being "kicked out" seem ominous: >>> >>> Aug 30 22:09:08 fcshome kernel: device-mapper: uevent: version 1.0.3 >>> Aug 30 22:09:08 fcshome kernel: device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) >>> initialised: dm-devel at redhat.com >>> Aug 30 22:09:08 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l >>> Aug 30 22:09:08 fcshome kernel: md: Autodetecting RAID arrays. >>> Aug 30 22:09:08 fcshome kernel: md: autorun ... >>> Aug 30 22:09:08 fcshome kernel: md: considering sdb2 ... >>> Aug 30 22:09:08 fcshome kernel: md: adding sdb2 ... >>> Aug 30 22:09:08 fcshome kernel: md: sdb1 has different UUID to sdb2 >>> Aug 30 22:09:08 fcshome kernel: md: adding sda2 ... >>> Aug 30 22:09:08 fcshome kernel: md: sda1 has different UUID to sdb2 >>> Aug 30 22:09:08 fcshome kernel: md: created md1 >>> Aug 30 22:09:08 fcshome kernel: md: bind<sda2> >>> Aug 30 22:09:08 fcshome kernel: md: bind<sdb2> >>> Aug 30 22:09:08 fcshome kernel: md: running: <sdb2><sda2> >>> Aug 30 22:09:08 fcshome kernel: md: kicking non-fresh sda2 from array! >>> Aug 30 22:09:08 fcshome kernel: md: unbind<sda2> >>> Aug 30 22:09:08 fcshome kernel: md: export_rdev(sda2) >>> Aug 30 22:09:08 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirro >>> rs >>> Aug 30 22:09:08 fcshome kernel: md: considering sdb1 ... >>> Aug 30 22:09:08 fcshome kernel: md: adding sdb1 ... >>> Aug 30 22:09:08 fcshome kernel: md: adding sda1 ... >>> Aug 30 22:09:08 fcshome kernel: md: created md0 >>> Aug 30 22:09:08 fcshome kernel: md: bind<sda1> >>> Aug 30 22:09:08 fcshome kernel: md: bind<sdb1> >>> Aug 30 22:09:08 fcshome kernel: md: running: <sdb1><sda1> >>> Aug 30 22:09:08 fcshome kernel: md: kicking non-fresh sda1 from array! >>> Aug 30 22:09:08 fcshome kernel: md: unbind<sda1> >>> Aug 30 22:09:08 fcshome kernel: md: export_rdev(sda1) >>> Aug 30 22:09:08 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirro >>> rs >>> Aug 30 22:09:08 fcshome kernel: md: ... autorun DONE. >> >> It looks like there is something wrong with sda... Your BIOS is booting >> grub from sda, grub is loading its conf, etc. from sda, but sda is not >> part of your raid sets of your running system. Your newer kernels are >> landing on sdb... > > yeah, that sounds like a possibility. >> >> I *think* you can fix this by using mdadm to add (mdadm --add ...) sda >> and make it rebuild sda1 and sda2 from sdb1 and sdb2. You mav have to >> --fail and --remove it first. > > I think you may be right. I'll give that a whirl at first opportunity. > > After posting this last night I did further digging and found that the > particular drives I'm using in this raid array are known to have long > timeouts, causing raid controllers (though I don't know if that includes > Linux's software RAID or not) to become confused and fail the mirror/ > drive when the timeout gets too long. There's apparently a WD utility > (these are WD drives) to change a setting for that (the utility is > wdtler.exe and the drive property is called TLER) which allegedly solves > the problem. Other posters have pointed out that the newer drives OF THE > SAME MODEL no longer let you set that. I haven't yet had the chance to > find out if my drives allow it to be changed or not, but since they're > somewhat over a year old I'm hopeful. Soon, I hope. Looks like I need > to find a way to make a DOS bootable floppy (then add a floppy drive to > the machine) so I can boot it up and give it a try. > > Poking around with smartctl indicates NO drive errors on either drive, > so I'm hopeful that the problem is "simply" as described above. > > If I can't change the setting I may have to replace the drives. :( > the entire REASON for buying two drives was so I would have some safety. > doggone drive manufacturers! Hi Fred. Somewhat OT but maybe of interest to you. I had to replace some WD drives after 3 years use. One kept giving out SMART messages which I ignored, till the drive went AWOL. The other had no SMART error messages whatsoever. That went down as well! So I'm on Hitachi HDD now. Reason being I had, and still have a Hitachi 2.5" drive in one of my laptops. The SMART test report looks really bad. The drive makes ominous clunking sounds wnen in use. I have been expecting it to fail for some time, but it just keeps going! Hitachi provide a DOS test program for their hard drives. http://www.hitachigst.com/support/downloads/#DFT The specs for the Deskstar looked good. Plus they have a 3 year warranty. [root at karsites ~]# smartctl -a /dev/sda smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar P7K500 series Device Model: Hitachi HDP725050GLAT80 Serial Number: xxxxxxxxxxxxx Firmware Version: GM4OA42A User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Tue Aug 31 15:29:30 2010 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled Keith - > ---- Fred Smith -- fredex at fcshome.stoneham.ma.us ----------------------------- > The eyes of the Lord are everywhere, > keeping watch on the wicked and the good. > ----------------------------- Proverbs 15:3 (niv) ----------------------------- > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >