[CentOS] Random server reboot after update to CentOS 5.3

Sat May 23 16:55:24 UTC 2009
Ross Walker <rswwalker at gmail.com>

On Fri, May 22, 2009 at 1:17 PM, Peter Hopfgartner
<peter.hopfgartner at r3-gis.com> wrote:
> JohnS wrote:
>>
>> Now why in the world would you want to do that??? You running 5.3 as per
>> your earlier post and your uname shows you running the Xen Kernel.
>> Always run the newest kernel *unless* there are very good reasons not to
>> and I do not see that for your situation. Use the latest 5.3 NON Xen
>> Kernel to test it with.
>
> A random kernel reboot on a production machine is a good reason, at
> least from my POV. It run fine for months with 5.2 and has now problems
> running with 5.3. If it is not able to run XEN, then I have to trash the
> whole thing, since the ASp services hosten on the machine are within XEN
> guests. No XEN - no business. And id DID run fine before the update.
>
>> My sejustion is unplug everything hooked to it but the power and network
>> cabling. Open it up while it is running, and shake the cables lightly
>> ( don't jerk on them). External disk array, unplug it also. USB floppies
>> and cd drives unplug emmm all.
>>
>> Is it under a heavy load? High cpu usage? Some times when there is a
>> power supply on the verge of dying you don't really know until disk I/O
>> climbs real high thus pulling loads of wattage. Pentium 4 and up cpus
>> are bad about this also.
>
> No heavy load, it crashes even at times when tere is almost noload at
> all. The power supplies are rwedundand and hardware monitoring tells me
> they are both fine, as is the rest of the hardware of the machine.
>
>> Run memtest86 for a few hours not just a min or two and say ahh it's ok.
>> It takes time. Is there gaps in your log files like white space?
>
> No gaps. Simply the machine restarts at a given moment. No shutdown, no
> traces of a kernel panic
>
>> Hardware raid controller updated to latest firmware release?
>
> Indeed, updating firmware and maybe some drivers from Dell's support
> site will be the next actions.
>
>>  Ok I guess
>> others can tack onto my list here as well.I wouldn't get to discouraged
>> because sometimes it can take days to find the problem.

Something I have been meaning to try is to see if LVM can be leveraged
to perform something like Solaris' live upgrade (of course without ZFS
it won't be as effecient) where you pin each release to their
respective sub-release version 5.0,5.1,5.2 etc, then clone the LV, put
in a new grub entry for the new sub-version release, then boot into
that cloned LV, increment the version in the repo file and yum upgrade
it to that version.

I suppose a new initrd will also need to be generated, but maybe a
script to automate it, maybe call it something like 'sysupgrade', it
can clone the root LV, mount it, upgrade the repo file, create a new
initrd, then add a grub entry.

This way if an upgrade doesn't work well for your application you can
back out for a little while until whatever is broken is fixed then
switch back to it.

Keep the root LV comparitively small, say 8GB, and just keep the prior
version, you definitely want to keep /home on a separate LV and
possibly /var depending on what apps you run.

Of course this doesn't mean one shouldn't fully test each update
before rolling it into production. If your app is mission critical,
buy two systems instead of one, so the second can be used for
redundancy and testing. If management balks at that, just say fine,
then don't complain when the production systems are down due to
inadequately tested software updates.

-Ross