On Fri, May 22, 2009 at 1:17 PM, Peter Hopfgartner peter.hopfgartner@r3-gis.com wrote:
JohnS wrote:
Now why in the world would you want to do that??? You running 5.3 as per your earlier post and your uname shows you running the Xen Kernel. Always run the newest kernel *unless* there are very good reasons not to and I do not see that for your situation. Use the latest 5.3 NON Xen Kernel to test it with.
A random kernel reboot on a production machine is a good reason, at least from my POV. It run fine for months with 5.2 and has now problems running with 5.3. If it is not able to run XEN, then I have to trash the whole thing, since the ASp services hosten on the machine are within XEN guests. No XEN - no business. And id DID run fine before the update.
My sejustion is unplug everything hooked to it but the power and network cabling. Open it up while it is running, and shake the cables lightly ( don't jerk on them). External disk array, unplug it also. USB floppies and cd drives unplug emmm all.
Is it under a heavy load? High cpu usage? Some times when there is a power supply on the verge of dying you don't really know until disk I/O climbs real high thus pulling loads of wattage. Pentium 4 and up cpus are bad about this also.
No heavy load, it crashes even at times when tere is almost noload at all. The power supplies are rwedundand and hardware monitoring tells me they are both fine, as is the rest of the hardware of the machine.
Run memtest86 for a few hours not just a min or two and say ahh it's ok. It takes time. Is there gaps in your log files like white space?
No gaps. Simply the machine restarts at a given moment. No shutdown, no traces of a kernel panic
Hardware raid controller updated to latest firmware release?
Indeed, updating firmware and maybe some drivers from Dell's support site will be the next actions.
Ok I guess others can tack onto my list here as well.I wouldn't get to discouraged because sometimes it can take days to find the problem.
Something I have been meaning to try is to see if LVM can be leveraged to perform something like Solaris' live upgrade (of course without ZFS it won't be as effecient) where you pin each release to their respective sub-release version 5.0,5.1,5.2 etc, then clone the LV, put in a new grub entry for the new sub-version release, then boot into that cloned LV, increment the version in the repo file and yum upgrade it to that version.
I suppose a new initrd will also need to be generated, but maybe a script to automate it, maybe call it something like 'sysupgrade', it can clone the root LV, mount it, upgrade the repo file, create a new initrd, then add a grub entry.
This way if an upgrade doesn't work well for your application you can back out for a little while until whatever is broken is fixed then switch back to it.
Keep the root LV comparitively small, say 8GB, and just keep the prior version, you definitely want to keep /home on a separate LV and possibly /var depending on what apps you run.
Of course this doesn't mean one shouldn't fully test each update before rolling it into production. If your app is mission critical, buy two systems instead of one, so the second can be used for redundancy and testing. If management balks at that, just say fine, then don't complain when the production systems are down due to inadequately tested software updates.
-Ross