[CentOS] Centos 5.5, not booting latest kernel but older one instead

Tue Aug 31 14:35:35 UTC 2010
Keith Roberts <keith at karsites.net>

On Tue, 31 Aug 2010, fred smith wrote:

> To: CentOS mailing list <centos at centos.org>
> From: fred smith <fredex at fcshome.stoneham.ma.us>
> Subject: Re: [CentOS] Centos 5.5,
>     not booting latest kernel but older one instead
> 
> On Tue, Aug 31, 2010 at 08:18:26AM -0400, Robert Heller wrote:
>> At Mon, 30 Aug 2010 22:24:19 -0400 CentOS mailing list <centos at centos.org> wrote:
>>
>>>
>>> On Mon, Aug 30, 2010 at 08:41:31PM -0500, Larry Vaden wrote:
>>>> On Mon, Aug 30, 2010 at 8:18 PM, fred smith
>>>> <fredex at fcshome.stoneham.ma.us> wrote:
>>>>>
>>>>> Below is some info that shows the problem. Can anyone here provide
>>>>> helpful suggestions on (1) why it is doing this, and more importantly (2)
>>>>> how I can make it stop?
>>>>
>>>> Is there a chance /boot is full (read: are all the menu'd kernels
>>>> actually present in /boot)?
>>>>
>>>> (IIRC something similar happened because the /boot partition was set
>>>> at the recommended size of 100 MB).
>>>
>>> /boot doesn't appear to be full, there appear to be 25.2 megabytes free with 20.1 available.
>>>
>>> another curious thing I just noticed is this: the list of kernels available
>>> at boot time (in the actual grub menu shown at boot) IS NOT THE SAME LIST
>>> THAT APPEARS IN GRUB.CONF. in the boot-time menu, the kernel it boots is
>>> the most recent one shown, and there are other older ones that do not
>>> appear in grub.conf. while in grub.conf there are several newer ones that
>>> do not appear on the boot-time grub menu.
>>>
>>> most strange.
>>>
>>> BTW, this is a raid-1 array using linux software raid, with two matching
>>> drives. Is there possibly some way the two drives could have gotten out
>>> of sync such that whichever one is the actual boot device has invalid
>>> info in /boot?
>>>
>>> and while thinking along those lines, I see a number of mails in root's
>>> mailbox from "md" notifying us of a degraded array. these all appear to have
>>> happened, AFAICT, at system boot, over the last several months.
>>>
>>> also, /var/log/messages contains a bunch of stuff like the below, also
>>> apparently at system boot, and I don't really know what it means, though
>>> the lines mentining a device being "kicked out" seem ominous:
>>>
>>> Aug 30 22:09:08 fcshome kernel: device-mapper: uevent: version 1.0.3
>>> Aug 30 22:09:08 fcshome kernel: device-mapper: ioctl: 4.11.5-ioctl (2007-12-12)
>>> initialised: dm-devel at redhat.com
>>> Aug 30 22:09:08 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l
>>> Aug 30 22:09:08 fcshome kernel: md: Autodetecting RAID arrays.
>>> Aug 30 22:09:08 fcshome kernel: md: autorun ...
>>> Aug 30 22:09:08 fcshome kernel: md: considering sdb2 ...
>>> Aug 30 22:09:08 fcshome kernel: md:  adding sdb2 ...
>>> Aug 30 22:09:08 fcshome kernel: md: sdb1 has different UUID to sdb2
>>> Aug 30 22:09:08 fcshome kernel: md:  adding sda2 ...
>>> Aug 30 22:09:08 fcshome kernel: md: sda1 has different UUID to sdb2
>>> Aug 30 22:09:08 fcshome kernel: md: created md1
>>> Aug 30 22:09:08 fcshome kernel: md: bind<sda2>
>>> Aug 30 22:09:08 fcshome kernel: md: bind<sdb2>
>>> Aug 30 22:09:08 fcshome kernel: md: running: <sdb2><sda2>
>>> Aug 30 22:09:08 fcshome kernel: md: kicking non-fresh sda2 from array!
>>> Aug 30 22:09:08 fcshome kernel: md: unbind<sda2>
>>> Aug 30 22:09:08 fcshome kernel: md: export_rdev(sda2)
>>> Aug 30 22:09:08 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirro
>>> rs
>>> Aug 30 22:09:08 fcshome kernel: md: considering sdb1 ...
>>> Aug 30 22:09:08 fcshome kernel: md:  adding sdb1 ...
>>> Aug 30 22:09:08 fcshome kernel: md:  adding sda1 ...
>>> Aug 30 22:09:08 fcshome kernel: md: created md0
>>> Aug 30 22:09:08 fcshome kernel: md: bind<sda1>
>>> Aug 30 22:09:08 fcshome kernel: md: bind<sdb1>
>>> Aug 30 22:09:08 fcshome kernel: md: running: <sdb1><sda1>
>>> Aug 30 22:09:08 fcshome kernel: md: kicking non-fresh sda1 from array!
>>> Aug 30 22:09:08 fcshome kernel: md: unbind<sda1>
>>> Aug 30 22:09:08 fcshome kernel: md: export_rdev(sda1)
>>> Aug 30 22:09:08 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirro
>>> rs
>>> Aug 30 22:09:08 fcshome kernel: md: ... autorun DONE.
>>
>> It looks like there is something wrong with sda... Your BIOS is booting
>> grub from sda, grub is loading its conf, etc. from sda, but sda is not
>> part of your raid sets of your running system.  Your newer kernels are
>> landing on sdb...
>
> yeah, that sounds like a possibility.
>>
>> I *think* you can fix this by using mdadm to add (mdadm --add ...)  sda
>> and make it rebuild sda1 and sda2 from sdb1 and sdb2. You mav have to
>> --fail and --remove it first.
>
> I think you may be right. I'll give that a whirl at first opportunity.
>
> After posting this last night I did further digging and found that the
> particular drives I'm using in this raid array are known to have long
> timeouts, causing raid controllers (though I don't know if that includes
> Linux's software RAID or not) to become confused and fail the mirror/
> drive when the timeout gets too long. There's apparently a WD utility
> (these are WD drives) to change a setting for that (the utility is
> wdtler.exe and the drive property is called TLER) which allegedly solves
> the problem. Other posters have pointed out that the newer drives OF THE
> SAME MODEL no longer let you set that. I haven't yet had the chance to
> find out if my drives allow it to be changed or not, but since they're
> somewhat over a year old I'm hopeful. Soon, I hope.  Looks like I need
> to find a way to make a DOS bootable floppy (then add a floppy drive to
> the machine) so I can boot it up and give it a try.
>
> Poking around with smartctl indicates NO drive errors on either drive,
> so I'm hopeful that the problem is "simply" as described above.
>
> If I can't change the setting I may have to replace the drives. :(
> the entire REASON for buying two drives was so I would have some safety.
> doggone drive manufacturers!

Hi Fred. Somewhat OT but maybe of interest to you.

I had to replace some WD drives after 3 years use.

One kept giving out SMART messages which I ignored, till the 
drive went AWOL. The other had no SMART error messages whatsoever.

That went down as well!

So I'm on Hitachi HDD now.

Reason being I had, and still have a Hitachi 2.5" drive in 
one of my laptops. The SMART test report looks really bad.

The drive makes ominous clunking sounds wnen in use. I have 
been expecting it to fail for some time, but it just keeps 
going!

Hitachi provide a DOS test program for their hard drives.

http://www.hitachigst.com/support/downloads/#DFT

The specs for the Deskstar looked good. Plus they have a 3 
year warranty.

[root at karsites ~]# smartctl -a /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] 
(local build)
Copyright (C) 2002-10 by Bruce Allen, 
http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar P7K500 series
Device Model:     Hitachi HDP725050GLAT80
Serial Number:    xxxxxxxxxxxxx
Firmware Version: GM4OA42A
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Aug 31 15:29:30 2010 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Keith




- 
> ---- Fred Smith -- fredex at fcshome.stoneham.ma.us -----------------------------
>                      The eyes of the Lord are everywhere,
>                    keeping watch on the wicked and the good.
> ----------------------------- Proverbs 15:3 (niv) -----------------------------
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>