Dear ML
We upgraded a Dell Poweredge PE 1950 Server the 8th of May. Since then the server rebooted 3 times without external cause (it is located in a server farm with redundant power supply etc.). Looking at the servers monitoring infrastructure with Dell's own OpenManage tools, I get strange errors:
[root@servernew ~]# omreport system esmlog
(....)
Severity : Critical Date and Time : Mon May 11 17:46:59 2009 Description : System Software event: run-time critical stop was asserted
Severity : Critical Date and Time : Fri May 15 21:07:57 2009 Description : System Software event: run-time critical stop was asserted
Severity : Critical Date and Time : Wed May 20 21:00:53 2009 Description : System Software event: run-time critical stop was asserted
(...)
This class of errors never happened before in over a year that the server is running.
There is no mention of any anomaly, except the boot messages itself, in /var/log/messages.
The server runs the 64 bit flavor of CentOS hosting some XEN virtual machines and some PostgreSQL and MySQL databases. It run without any issues with CentOS 5.1 and 5.2.
I interpreted these issues as some kernel/software related problem, but do not know how to make a more accurate diagnosis of the problem.
Can anybody give me some hint? Has anybody had some similar issue?
Regards,
Peter
On Thu, 2009-05-21 at 15:13 +0200, Peter Hopfgartner wrote:
Dear ML
We upgraded a Dell Poweredge PE 1950 Server the 8th of May. Since then the server rebooted 3 times without external cause (it is located in a server farm with redundant power supply etc.). Looking at the servers monitoring infrastructure with Dell's own OpenManage tools, I get strange errors:
[root@servernew ~]# omreport system esmlog
(....)
Severity : Critical Date and Time : Mon May 11 17:46:59 2009 Description : System Software event: run-time critical stop was asserted
Severity : Critical Date and Time : Fri May 15 21:07:57 2009 Description : System Software event: run-time critical stop was asserted
Severity : Critical Date and Time : Wed May 20 21:00:53 2009 Description : System Software event: run-time critical stop was asserted
(...)
This class of errors never happened before in over a year that the server is running.
There is no mention of any anomaly, except the boot messages itself, in /var/log/messages.
The server runs the 64 bit flavor of CentOS hosting some XEN virtual machines and some PostgreSQL and MySQL databases. It run without any issues with CentOS 5.1 and 5.2.
I interpreted these issues as some kernel/software related problem, but do not know how to make a more accurate diagnosis of the problem.
Can anybody give me some hint? Has anybody had some similar issue?
Hmm... you *definitely* want to take this one to the Dell Linux list. Having said that, I did some googling for:
omreport run-time critical stop was asserted
and found only one hit for someone that faced it in April 2007. And Dell told them that it may have been software. I'd start there. Some additional questions: What version of CentOS? What kernel version? What version of the Dell tools?
-I
Ian Forde wrote:
On Thu, 2009-05-21 at 15:13 +0200, Peter Hopfgartner wrote:
Dear ML
We upgraded a Dell Poweredge PE 1950 Server the 8th of May. Since then the server rebooted 3 times without external cause (it is located in a server farm with redundant power supply etc.). Looking at the servers monitoring infrastructure with Dell's own OpenManage tools, I get strange errors:
(...)
The server runs the 64 bit flavor of CentOS hosting some XEN virtual machines and some PostgreSQL and MySQL databases. It run without any issues with CentOS 5.1 and 5.2.
I interpreted these issues as some kernel/software related problem, but do not know how to make a more accurate diagnosis of the problem.
Can anybody give me some hint? Has anybody had some similar issue?
Hmm... you *definitely* want to take this one to the Dell Linux list.
Ok, done that: http://lists.us.dell.com/pipermail/linux-poweredge/2009-May/039257.html
Having said that, I did some googling for:
omreport run-time critical stop was asserted
and found only one hit for someone that faced it in April 2007.
Ok, now there are two in google search results!
And Dell told them that it may have been software. I'd start there. Some additional questions: What version of CentOS? What kernel version? What version of the Dell tools?
Indeed, since the only major thing that changed was an yum update passing from 52. to 5.3, this is what I would guess, too. But my craft stops here.
-I
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS: 5.3 + PostgreSQL from CentOS Testing and some home build GIS package
uname -a: Linux servernew 2.6.18-128.1.10.el5xen #1 SMP Thu May 7 11:07:18 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@servernew ~]# omreport about
Product name : Server Administrator Install Core (subscription) Version : 5.5.0 Copyright : Copyright (C) Dell Inc. 1995-2008. All rights reserved. Company : Dell Inc.
Regards,
Peter
On Thu, May 21, 2009 at 11:13 AM, Peter Hopfgartner peter.hopfgartner@r3-gis.com wrote:
We upgraded a Dell Poweredge PE 1950 Server the 8th of May. Since then the server rebooted 3 times without external cause (it is located in a server farm with redundant power supply etc.). Looking at the servers monitoring infrastructure with Dell's own OpenManage tools, I get strange errors:
The server runs the 64 bit flavor of CentOS hosting some XEN virtual machines and some PostgreSQL and MySQL databases. It run without any issues with CentOS 5.1 and 5.2.
I interpreted these issues as some kernel/software related problem, but do not know how to make a more accurate diagnosis of the problem.
<snip>
Indeed, since the only major thing that changed was an yum update passing from 52. to 5.3, this is what I would guess, too. But my craft stops here.
It would have been helpful, if the error message told you which system software. :-) The Upgrade from 5.2 to 5.2 seems to have been problematic for some people on this list. When you did the Upgrade, did you follow this sequence, per the CentOS 5.3 Release Notes? yum clean all && yum update glibc* && yum update If not, that may or may not have anything to do with the reboots and error messages you have received.
Lanny Marcus wrote:
It would have been helpful, if the error message told you which system software. :-) The Upgrade from 5.2 to 5.2 seems to have been problematic for some people on this list. When you did the Upgrade, did you follow this sequence, per the CentOS 5.3 Release Notes? yum clean all && yum update glibc* && yum update If not, that may or may not have anything to do with the reboots and error messages you have received.
No, it would not have to do anything with the spontaneous reboots. The problem with glibc only concerns rpm, as the release notes clearly state.
Ralph
On Thu, 2009-05-21 at 20:07 +0200, Ralph Angenendt wrote:
Lanny Marcus wrote:
It would have been helpful, if the error message told you which system software. :-) The Upgrade from 5.2 to 5.2 seems to have been problematic for some people on this list. When you did the Upgrade, did you follow this sequence, per the CentOS 5.3 Release Notes? yum clean all && yum update glibc* && yum update If not, that may or may not have anything to do with the reboots and error messages you have received.
No, it would not have to do anything with the spontaneous reboots. The problem with glibc only concerns rpm, as the release notes clearly state.
Since nobody else mentioned it to th OP, ...
Let us not forget that often the hardware chooses to act up around the same time that some kind of (software) upgrade is performed. I've wasted a lot of time in the past *assuming* that because the hardware was rock-solid in the past, it must have been some change (I made) to the software.
I suggest running diagnostics, or manually re-seating everything (especially if you had occasion to move the unit or open it recently).
Memory used to age, does it still? Memtest*86 might be in order.
Ralph
<snip sig stuff>
HTH
At Thu, 21 May 2009 14:30:42 -0400 CentOS mailing list centos@centos.org wrote:
On Thu, 2009-05-21 at 20:07 +0200, Ralph Angenendt wrote:
Lanny Marcus wrote:
It would have been helpful, if the error message told you which system software. :-) The Upgrade from 5.2 to 5.2 seems to have been problematic for some people on this list. When you did the Upgrade, did you follow this sequence, per the CentOS 5.3 Release Notes? yum clean all && yum update glibc* && yum update If not, that may or may not have anything to do with the reboots and error messages you have received.
No, it would not have to do anything with the spontaneous reboots. The problem with glibc only concerns rpm, as the release notes clearly state.
Since nobody else mentioned it to th OP, ...
Let us not forget that often the hardware chooses to act up around the same time that some kind of (software) upgrade is performed. I've wasted a lot of time in the past *assuming* that because the hardware was rock-solid in the past, it must have been some change (I made) to the software.
I suggest running diagnostics, or manually re-seating everything (especially if you had occasion to move the unit or open it recently).
Memory used to age, does it still? Memtest*86 might be in order.
Not so much as age, but contacts corrode and lots of components change values as they are heated and cooled (expand or shrink). And yes, heating and cooling causes small dimensional changes in connectors -- this can (over time) work connectors loose.
It is possible that software is *partly* to blame, if only because the software changes might cause little used bits (litterally bits!) of hardware (eg memory) to be used more than they were. You'd never know about bad memory if it is not actually used. I used to have a K6500 that was 'fussy' about a specific stick of 128meg of RAM. The machine was perfectly fine, until I did backups, which involved writing a very large tar file to a removable disk. The disk I/O would fail. There was nothing wrong with the disk(s) or the software. It was the extra 128meg DIMM. And there was nothing really wrong with the DIMM either. It was just a wee bit too slow for the K6500. (eg it was PC99.99999 instead of PC100 or something like that). Only certian sorts of memory accesses would fail.
Ralph
<snip sig stuff>
HTH
William L. Maltby wrote:
On Thu, 2009-05-21 at 20:07 +0200, Ralph Angenendt wrote:
Lanny Marcus wrote:
It would have been helpful, if the error message told you which system software. :-) The Upgrade from 5.2 to 5.2 seems to have been problematic for some people on this list. When you did the Upgrade, did you follow this sequence, per the CentOS 5.3 Release Notes? yum clean all && yum update glibc* && yum update If not, that may or may not have anything to do with the reboots and error messages you have received.
No, it would not have to do anything with the spontaneous reboots. The problem with glibc only concerns rpm, as the release notes clearly state.
Since nobody else mentioned it to th OP, ...
Let us not forget that often the hardware chooses to act up around the same time that some kind of (software) upgrade is performed. I've wasted a lot of time in the past *assuming* that because the hardware was rock-solid in the past, it must have been some change (I made) to the software.
I suggest running diagnostics, or manually re-seating everything (especially if you had occasion to move the unit or open it recently).
Memory used to age, does it still? Memtest*86 might be in order.
The machine has ECC RAM and extensive hardware logging build in. It should be quite uncommon that the build in server logging would not notice a memory failure. The server diagnostics from Dell do not show any hardware failure at all, only a software problem. In my experience, memory problems lead in most cases to application instability, but I've never seen a reboot like this, that leaves absolutely no trace in the Linux logs. Taking the reboot, that happened tonight:
The Dell utility says:
Severity : Critical Date and Time : Thu May 21 21:16:16 2009 Description : System Software event: run-time critical stop was asserted
From /var/log/messages:
May 21 10:58:38 servernew auditd[18962]: Init complete, auditd 1.7.7 listening f or events (startup state enable) May 21 21:18:58 servernew syslogd 1.4.1: restart. May 21 21:18:58 servernew kernel: klogd 1.4.1, log source = /proc/kmsg started. May 21 21:18:58 servernew kernel: Bootdata ok (command line is ro root=LABEL=/) May 21 21:18:58 servernew kernel: Linux version 2.6.18-128.1.10.el5xen (mockbuil d@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP T hu May 7 11:07:18 EDT 2009 May 21 21:18:58 servernew kernel: BIOS-provided physical RAM map:
Reassuming: nothing happens for hours and then, zack!, boom!, the server restarts.
Regards,
Peter
Ralph
<snip sig stuff>
HTH
Ralph Angenendt wrote:
Lanny Marcus wrote:
It would have been helpful, if the error message told you which system software. :-) The Upgrade from 5.2 to 5.2 seems to have been problematic for some people on this list. When you did the Upgrade, did you follow this sequence, per the CentOS 5.3 Release Notes? yum clean all && yum update glibc* && yum update If not, that may or may not have anything to do with the reboots and error messages you have received.
No, it would not have to do anything with the spontaneous reboots. The problem with glibc only concerns rpm, as the release notes clearly state.
Ralph
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Would it make sense to install the kernel from CentOS 5.2? Any contraindications?
Regards,
Peter
On Fri, 2009-05-22 at 09:44 +0200, Peter Hopfgartner wrote: <snip>
Would it make sense to install the kernel from CentOS 5.2? Any contraindications?
Regards,
Peter
--- Now why in the world would you want to do that??? You running 5.3 as per your earlier post and your uname shows you running the Xen Kernel. Always run the newest kernel *unless* there are very good reasons not to and I do not see that for your situation. Use the latest 5.3 NON Xen Kernel to test it with.
My sejustion is unplug everything hooked to it but the power and network cabling. Open it up while it is running, and shake the cables lightly ( don't jerk on them). External disk array, unplug it also. USB floppies and cd drives unplug emmm all.
Is it under a heavy load? High cpu usage? Some times when there is a power supply on the verge of dying you don't really know until disk I/O climbs real high thus pulling loads of wattage. Pentium 4 and up cpus are bad about this also.
Run memtest86 for a few hours not just a min or two and say ahh it's ok. It takes time. Is there gaps in your log files like white space? Hardware raid controller updated to latest firmware release? Ok I guess others can tack onto my list here as well.I wouldn't get to discouraged because sometimes it can take days to find the problem.
JohnS wrote:
My sejustion is unplug everything hooked to it but the power and network cabling. Open it up while it is running, and shake the cables lightly ( don't jerk on them). External disk array, unplug it also. USB floppies and cd drives unplug emmm all.
Is it under a heavy load? High cpu usage? Some times when there is a power supply on the verge of dying you don't really know until disk I/O climbs real high thus pulling loads of wattage.
That's my guess. I'd swap out the power supply.
My personal experience with ram issues is either kernel panic or filesystem funnyness (sometimes resulting in filesystems being remounted read only). My experience with disk I/O issues is that forcing fsck reveals filesystem errors with high frequency.
Rebooting machines in my experience is almost always a failing power supply (or faulty power source - check your UPS, when they start to go bad they can cause issues).
If it was a kernel issue, I suspect more people would be experiencing (unless it is caused by a third party kmod)
Actually, I've also been experience this issue on a two identical custom built systems running 5.3 x64 with Xen. I experienced the issue under the same kernel that Peter is running and the first kernel released with 5.3. In my particular instance, I'm attributing these random crashes to hardware problems since I'm only experiencing the issues on these two systems and not an older Dell PowerEdge 850 which is set up with the same software configuration. Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mccarrms@clarkson.edu
On Fri, May 22, 2009 at 10:26 AM, Michael A. Peters mpeters@mac.com wrote:
JohnS wrote:
My sejustion is unplug everything hooked to it but the power and network cabling. Open it up while it is running, and shake the cables lightly ( don't jerk on them). External disk array, unplug it also. USB floppies and cd drives unplug emmm all.
Is it under a heavy load? High cpu usage? Some times when there is a power supply on the verge of dying you don't really know until disk I/O climbs real high thus pulling loads of wattage.
That's my guess. I'd swap out the power supply.
My personal experience with ram issues is either kernel panic or filesystem funnyness (sometimes resulting in filesystems being remounted read only). My experience with disk I/O issues is that forcing fsck reveals filesystem errors with high frequency.
Rebooting machines in my experience is almost always a failing power supply (or faulty power source - check your UPS, when they start to go bad they can cause issues).
If it was a kernel issue, I suspect more people would be experiencing (unless it is caused by a third party kmod) _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
The guys on the Dell PowerEdge ML seem to be attracted by the idea that it is a driver problem of the network adapter. In this case, lspci gives me:
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
Does this match your adapter?
Regards,
Peter
Mathew S. McCarrell wrote:
Actually, I've also been experience this issue on a two identical custom built systems running 5.3 x64 with Xen. I experienced the issue under the same kernel that Peter is running and the first kernel released with 5.3.
In my particular instance, I'm attributing these random crashes to hardware problems since I'm only experiencing the issues on these two systems and not an older Dell PowerEdge 850 which is set up with the same software configuration.
Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mailto:mccarrms@gmail.com mccarrms@clarkson.edu mailto:mccarrms@clarkson.edu
On Fri, May 22, 2009 at 10:26 AM, Michael A. Peters <mpeters@mac.com mailto:mpeters@mac.com> wrote:
JohnS wrote: > > My sejustion is unplug everything hooked to it but the power and network > cabling. Open it up while it is running, and shake the cables lightly > ( don't jerk on them). External disk array, unplug it also. USB floppies > and cd drives unplug emmm all. > > Is it under a heavy load? High cpu usage? Some times when there is a > power supply on the verge of dying you don't really know until disk I/O > climbs real high thus pulling loads of wattage. That's my guess. I'd swap out the power supply. My personal experience with ram issues is either kernel panic or filesystem funnyness (sometimes resulting in filesystems being remounted read only). My experience with disk I/O issues is that forcing fsck reveals filesystem errors with high frequency. Rebooting machines in my experience is almost always a failing power supply (or faulty power source - check your UPS, when they start to go bad they can cause issues). If it was a kernel issue, I suspect more people would be experiencing (unless it is caused by a third party kmod) _______________________________________________ CentOS mailing list CentOS@centos.org <mailto:CentOS@centos.org> http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Nope, all of my adapters are Intel. 0f:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) 0f:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) 20:04.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05)
Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mccarrms@clarkson.edu
On Fri, May 22, 2009 at 1:20 PM, Peter Hopfgartner < peter.hopfgartner@r3-gis.com> wrote:
The guys on the Dell PowerEdge ML seem to be attracted by the idea that it is a driver problem of the network adapter. In this case, lspci gives me:
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
Does this match your adapter?
Regards,
Peter
Mathew S. McCarrell wrote:
Actually, I've also been experience this issue on a two identical custom built systems running 5.3 x64 with Xen. I experienced the issue under the same kernel that Peter is running and the first kernel released with 5.3.
In my particular instance, I'm attributing these random crashes to hardware problems since I'm only experiencing the issues on these two systems and not an older Dell PowerEdge 850 which is set up with the same software configuration.
Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mailto:mccarrms@gmail.com mccarrms@clarkson.edu mailto:mccarrms@clarkson.edu
On Fri, May 22, 2009 at 10:26 AM, Michael A. Peters <mpeters@mac.com mailto:mpeters@mac.com> wrote:
JohnS wrote: > > My sejustion is unplug everything hooked to it but the power and network > cabling. Open it up while it is running, and shake the cables lightly > ( don't jerk on them). External disk array, unplug it also. USB floppies > and cd drives unplug emmm all. > > Is it under a heavy load? High cpu usage? Some times when there is
a
> power supply on the verge of dying you don't really know until disk I/O > climbs real high thus pulling loads of wattage. That's my guess. I'd swap out the power supply. My personal experience with ram issues is either kernel panic or filesystem funnyness (sometimes resulting in filesystems being remounted read only). My experience with disk I/O issues is that forcing fsck reveals filesystem errors with high frequency. Rebooting machines in my experience is almost always a failing power supply (or faulty power source - check your UPS, when they start to
go
bad they can cause issues). If it was a kernel issue, I suspect more people would be experiencing (unless it is caused by a third party kmod) _______________________________________________ CentOS mailing list CentOS@centos.org <mailto:CentOS@centos.org> http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
--
Dott. Peter Hopfgartner
R3 GIS Srl - GmbH Via Johann Kravogl-Str. 2 I-39012 Meran/Merano (BZ) Email: peter.hopfgartner@r3-gis.com Tel. : +39 0473 494949 Fax : +39 0473 069902 www : http://www.r3-gis.com
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
JohnS wrote:
On Fri, 2009-05-22 at 09:44 +0200, Peter Hopfgartner wrote:
<snip>
Would it make sense to install the kernel from CentOS 5.2? Any contraindications?
Regards,
Peter
Now why in the world would you want to do that??? You running 5.3 as per your earlier post and your uname shows you running the Xen Kernel. Always run the newest kernel *unless* there are very good reasons not to and I do not see that for your situation. Use the latest 5.3 NON Xen Kernel to test it with.
A random kernel reboot on a production machine is a good reason, at least from my POV. It run fine for months with 5.2 and has now problems running with 5.3. If it is not able to run XEN, then I have to trash the whole thing, since the ASp services hosten on the machine are within XEN guests. No XEN - no business. And id DID run fine before the update.
My sejustion is unplug everything hooked to it but the power and network cabling. Open it up while it is running, and shake the cables lightly ( don't jerk on them). External disk array, unplug it also. USB floppies and cd drives unplug emmm all.
Is it under a heavy load? High cpu usage? Some times when there is a power supply on the verge of dying you don't really know until disk I/O climbs real high thus pulling loads of wattage. Pentium 4 and up cpus are bad about this also.
No heavy load, it crashes even at times when tere is almost noload at all. The power supplies are rwedundand and hardware monitoring tells me they are both fine, as is the rest of the hardware of the machine.
Run memtest86 for a few hours not just a min or two and say ahh it's ok. It takes time. Is there gaps in your log files like white space?
No gaps. Simply the machine restarts at a given moment. No shutdown, no traces of a kernel panic
Hardware raid controller updated to latest firmware release?
Indeed, updating firmware and maybe some drivers from Dell's support site will be the next actions.
Ok I guess others can tack onto my list here as well.I wouldn't get to discouraged because sometimes it can take days to find the problem.
Thank you for the infos,
Peter
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
on 5-22-2009 10:17 AM Peter Hopfgartner spake the following:
JohnS wrote:
On Fri, 2009-05-22 at 09:44 +0200, Peter Hopfgartner wrote:
<snip>
Would it make sense to install the kernel from CentOS 5.2? Any contraindications?
Regards,
Peter
Now why in the world would you want to do that??? You running 5.3 as per your earlier post and your uname shows you running the Xen Kernel. Always run the newest kernel *unless* there are very good reasons not to and I do not see that for your situation. Use the latest 5.3 NON Xen Kernel to test it with.
A random kernel reboot on a production machine is a good reason, at least from my POV. It run fine for months with 5.2 and has now problems running with 5.3. If it is not able to run XEN, then I have to trash the whole thing, since the ASp services hosten on the machine are within XEN guests. No XEN - no business. And id DID run fine before the update.
If the system was upgraded from 5.2, the 5.2 kernel should still be there. I would just run with the 5.2 zen kernel long enough to see if the problem goes away. If it does, then there is something in the new kernel that just doesn't aggree with your server and its hardware configuration. If the problem persists, then you have a good case for trying a power supply, but you could also have another card going south and generating NMI's. Sometimes it just isn't easy.
On Fri, May 22, 2009 at 1:17 PM, Peter Hopfgartner peter.hopfgartner@r3-gis.com wrote:
JohnS wrote:
Now why in the world would you want to do that??? You running 5.3 as per your earlier post and your uname shows you running the Xen Kernel. Always run the newest kernel *unless* there are very good reasons not to and I do not see that for your situation. Use the latest 5.3 NON Xen Kernel to test it with.
A random kernel reboot on a production machine is a good reason, at least from my POV. It run fine for months with 5.2 and has now problems running with 5.3. If it is not able to run XEN, then I have to trash the whole thing, since the ASp services hosten on the machine are within XEN guests. No XEN - no business. And id DID run fine before the update.
My sejustion is unplug everything hooked to it but the power and network cabling. Open it up while it is running, and shake the cables lightly ( don't jerk on them). External disk array, unplug it also. USB floppies and cd drives unplug emmm all.
Is it under a heavy load? High cpu usage? Some times when there is a power supply on the verge of dying you don't really know until disk I/O climbs real high thus pulling loads of wattage. Pentium 4 and up cpus are bad about this also.
No heavy load, it crashes even at times when tere is almost noload at all. The power supplies are rwedundand and hardware monitoring tells me they are both fine, as is the rest of the hardware of the machine.
Run memtest86 for a few hours not just a min or two and say ahh it's ok. It takes time. Is there gaps in your log files like white space?
No gaps. Simply the machine restarts at a given moment. No shutdown, no traces of a kernel panic
Hardware raid controller updated to latest firmware release?
Indeed, updating firmware and maybe some drivers from Dell's support site will be the next actions.
Ok I guess others can tack onto my list here as well.I wouldn't get to discouraged because sometimes it can take days to find the problem.
Something I have been meaning to try is to see if LVM can be leveraged to perform something like Solaris' live upgrade (of course without ZFS it won't be as effecient) where you pin each release to their respective sub-release version 5.0,5.1,5.2 etc, then clone the LV, put in a new grub entry for the new sub-version release, then boot into that cloned LV, increment the version in the repo file and yum upgrade it to that version.
I suppose a new initrd will also need to be generated, but maybe a script to automate it, maybe call it something like 'sysupgrade', it can clone the root LV, mount it, upgrade the repo file, create a new initrd, then add a grub entry.
This way if an upgrade doesn't work well for your application you can back out for a little while until whatever is broken is fixed then switch back to it.
Keep the root LV comparitively small, say 8GB, and just keep the prior version, you definitely want to keep /home on a separate LV and possibly /var depending on what apps you run.
Of course this doesn't mean one shouldn't fully test each update before rolling it into production. If your app is mission critical, buy two systems instead of one, so the second can be used for redundancy and testing. If management balks at that, just say fine, then don't complain when the production systems are down due to inadequately tested software updates.
-Ross
On Friday 22 May 2009, Peter Hopfgartner wrote: ...
Would it make sense to install the kernel from CentOS 5.2? Any contraindications?
As others have said, you should still have the 5.2 kernel around. Just change the grub.conf and reboot. It makes no sense to start swapping around hardware until you've tried to revert the kernel.
That said, we've seen hangs and strange kernel messages on several different server platforms (HP DL140g3: NMI-related messages logged, HP DL160g5: hangs semi-randomly) with the new 5.3 kernels. All of these problems could be worked around by booting with the kernel option "nmi_watchdog=0".
/Peter
Epilogue:
I've tried to disable TSO (ethtool -K eth0 tso off), as was suggested on the poweredge list. This did not help.
I've configured the machine to start with the 5.2 kernel in /boot/grub/grub.conf, changing the default. It has been running for 6 1/2 days, now. I would say that this helped and is what I would suggest to others experiencing the same problem, right now.
Thus, current running kernel is 2.6.18-92.1.10.el5xen.
Regards and thanks for all replies,
Peter
Peter Hopfgartner wrote:
Dear ML
We upgraded a Dell Poweredge PE 1950 Server the 8th of May. Since then the server rebooted 3 times without external cause (it is located in a server farm with redundant power supply etc.). Looking at the servers monitoring infrastructure with Dell's own OpenManage tools, I get strange errors:
[root@servernew ~]# omreport system esmlog
(....)
Severity : Critical Date and Time : Mon May 11 17:46:59 2009 Description : System Software event: run-time critical stop was asserted
Severity : Critical Date and Time : Fri May 15 21:07:57 2009 Description : System Software event: run-time critical stop was asserted
Severity : Critical Date and Time : Wed May 20 21:00:53 2009 Description : System Software event: run-time critical stop was asserted
(...)
This class of errors never happened before in over a year that the server is running.
There is no mention of any anomaly, except the boot messages itself, in /var/log/messages.
The server runs the 64 bit flavor of CentOS hosting some XEN virtual machines and some PostgreSQL and MySQL databases. It run without any issues with CentOS 5.1 and 5.2.
I interpreted these issues as some kernel/software related problem, but do not know how to make a more accurate diagnosis of the problem.
Can anybody give me some hint? Has anybody had some similar issue?
Regards,
Peter
on 6-3-2009 2:27 AM Peter Hopfgartner spake the following:
Epilogue:
I've tried to disable TSO (ethtool -K eth0 tso off), as was suggested on the poweredge list. This did not help.
I've configured the machine to start with the 5.2 kernel in /boot/grub/grub.conf, changing the default. It has been running for 6 1/2 days, now. I would say that this helped and is what I would suggest to others experiencing the same problem, right now.
Thus, current running kernel is 2.6.18-92.1.10.el5xen.
Regards and thanks for all replies,
Peter
That sure points to a machine/kernel conflict. You could try getting the source and rebuilding to see if that solves it, or maybe a diff of the two kernel configs to see if something is different there. Maybe someting is added or turned on in the new kernel that your system doesn't like.
Also, make sure your systems bioses are up to date. Not just motherboard, but any other cards that have firmware that might have an update like raidcard/sas controllers or ???
Scott Silva wrote:
on 6-3-2009 2:27 AM Peter Hopfgartner spake the following:
Epilogue:
I've tried to disable TSO (ethtool -K eth0 tso off), as was suggested on the poweredge list. This did not help.
I've configured the machine to start with the 5.2 kernel in /boot/grub/grub.conf, changing the default. It has been running for 6 1/2 days, now. I would say that this helped and is what I would suggest to others experiencing the same problem, right now.
Thus, current running kernel is 2.6.18-92.1.10.el5xen.
Regards and thanks for all replies,
Peter
That sure points to a machine/kernel conflict. You could try getting the source and rebuilding to see if that solves it, or maybe a diff of the two kernel configs to see if something is different there. Maybe someting is added or turned on in the new kernel that your system doesn't like.
Also, make sure your systems bioses are up to date. Not just motherboard, but any other cards that have firmware that might have an update like raidcard/sas controllers or ???
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Dear Scott,
unfortunately the machine is in production. Any downtime is really a problem since it is seen directly by our customers. I would really like to do some active effort to isolate the problem, but my boss would cut my head off, if I have to stop the machine. The firmware is not current, but according to Dell's web site I should stop almost every running service on the machine before upgrading the firmware, and in this case I would again have to watch out for my head. I do really care to provide accurate bug reports to OS projects that I use (I would guess that 90 % of my reports lead to a quick fix), but in this case I do have to make an exception and keep the machine running.
Thanks,
Peter
On Jun 8, 2009, at 9:18 AM, Peter Hopfgartner <peter.hopfgartner@r3-gis.com
wrote:
Scott Silva wrote:
on 6-3-2009 2:27 AM Peter Hopfgartner spake the following:
Epilogue:
I've tried to disable TSO (ethtool -K eth0 tso off), as was suggested on the poweredge list. This did not help.
I've configured the machine to start with the 5.2 kernel in /boot/grub/grub.conf, changing the default. It has been running for 6 1/2 days, now. I would say that this helped and is what I would suggest to others experiencing the same problem, right now.
Thus, current running kernel is 2.6.18-92.1.10.el5xen.
Regards and thanks for all replies,
Peter
That sure points to a machine/kernel conflict. You could try getting the source and rebuilding to see if that solves it, or maybe a diff of the two kernel configs to see if something is different there. Maybe someting is added or turned on in the new kernel that your system doesn't like.
Also, make sure your systems bioses are up to date. Not just motherboard, but any other cards that have firmware that might have an update like raidcard/sas controllers or ???
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Dear Scott,
unfortunately the machine is in production. Any downtime is really a problem since it is seen directly by our customers. I would really like to do some active effort to isolate the problem, but my boss would cut my head off, if I have to stop the machine. The firmware is not current, but according to Dell's web site I should stop almost every running service on the machine before upgrading the firmware, and in this case I would again have to watch out for my head. I do really care to provide accurate bug reports to OS projects that I use (I would guess that 90 % of my reports lead to a quick fix), but in this case I do have to make an exception and keep the machine running.
Do what works for now and think about a test box or VM setup for the future where you can test newer kernels before they go into production.
-Ross