Hi all,
we are running CentOS 5.2 Xen virtualization system with the latest CentOS packages with couple of VMs on DELL PowerEdge. "Sometimes" the whole machine freezes without anything in log files, anything on the console. "Sometimes" really means we cannot define why or when. Sometimes the machine was idle with just one VM, sometimes quite busy with couple of VMs.
Has anybody had the same experience? If yes, any hints on how to resolve it or how to trace the cause?
Thanks.
On Fri, Apr 3, 2009 at 4:18 PM, Maros TIMKO timko@pobox.sk wrote:
The complete freezing of a machine like that sounds like a hardware issue to me, most likely the memory. Does the machine unfreeze after a while or do you have to power cycle the server when it happens ? I would suggest running a memtest.
Regards, Tim
Yes,
mem and disk check was also our first thing to do. But it happened on different machines (1950s and 2950s), different BIOS versions and number of NICs. The freeze situation is unrecoverable - machine replies to pings, but did not write anything to console. You cannot SSH to it, the only thing we could do is power down the machine. After that everything is fine.
Thanks.
2009/4/3 Tim Verhoeven tim.verhoeven.be@gmail.com
On Sat, Apr 4, 2009 at 3:43 AM, Maros Timko timkom@gmail.com wrote:
I've had Xen running for around 2 1/2 years on a couple of vintages of 2950s under both Ubuntu and CentOS and it's been very reliable in all combinations. The only issue I can remember having is when the dom0 kernel doesn't have enough memory available to it, it can seem a bit catatonic.
I've had to deal with issues like these in the past and I can say they always suck. Normally, the whole OS freezes due to a hardware issue. Isolating the cause is extremely time consuming. If it happens on a regular basis, (I.E. every 60 or maybe 90 days) the most likely culprit is the DRAC card. There is a known issue where a virtual USB floppy or CD device spontaneously disappears from the OS, causing an OS freeze. I believe there is a kernel parameter to pass and a firmware upgrade to apply.
When addressing any hardware issue, the default response from the vendor will always be "have you upgraded the BIOS and firmware on all the cards"? In general, that will be your first step. The next canned reply will be "do you have any third-party cards or equipment?" (External USB drives, third-party memory, unsupported cards, etc.). If so, you will be told to remove them or you're on your own.
Check the controller card logs (BIOS, DRAC, RAID Controller, etc.) and run the Dell diagnostic tools on the server. Make sure you run a full check on the memory. (You might also try swap memory DIMM positions to see if the behavior changes.) The dmesg log is your friend.
Investigate setting up net-dump to create a crash dump file on a remote system.
A remote monitoring system, collecting system logs, snmp traps and performing active monitoring can be useful in identifying any events that lead up to the system freeze. (I.E. Memory slowly leaking away, processor spiking, etc.) If you have a DRAC or BMC card, configure it with an IP address and to send SNMP traps to a monitoring system.
Pay attention to any physical changes that coincide with the freeze. (I.E. fans are running full bore, which normally means some instruction ran into a loop.)
Just a note, you really want your Xen system to be running bare bones. Do not install any unnecessary packages. It just complicates your troubleshooting in this instance.
Configuring the server to send syslog messages to tty12 or serial console to monitor on a another system) can sometimes be helpful to see what the last write was supposed to be (if the disk is dying before a write). Add the following to syslog.conf and leave your console on tty12 (since you won't be able to change it after a freeze).
# Log everything to tty12 *.* /dev/tty12
I thought I read that the PAE kernel is superficial (since 5.x), but maybe that is with Cent 5.3.
Maros TIMKO wrote:
Hi all,
thanks to all for valuable replies. It seems like we identified the issue. We assured that it is not HW related as it was already reproduced on different machines and platforms, with different BIOS versions. We are running a system performance/statistics collector that executes "xentop" command on Dom0 regularly. This is causing issues. If we execute: xentop -b -d 0.1 > /dev/null in multiple instances, it will freeze the system. It was reproduced on CentOS 5.3 (kernel-xen-2.6.18-128.1.6.el5) system. There is created a bug for this issue: http://bugs.centos.org/view.php?id=3454
With regards,
Tino
2009/4/3 Maros TIMKO timko@pobox.sk
Hey,
I'm wondering if it is possible that your problem is related to mine. Earlier today I had to restart one of our domUs on one of our systems. I used xm shutdown instead of xm destroy and then did xm list to determine if the domU had shutdown or not. Upon issuing xm list a second time, the entire server crashed and rebooted.
I've checked the logs and have yet to find anything. I've attached a transcript of the commands as I executed them on the server. The system is running CentOS 5.3 x64 w/Xen (kernel 2.6.18-128.1.6.el5xen).
Any thoughts?
Thanks, Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mccarrms@clarkson.edu
2009/4/7 Maros Timko timkom@gmail.com
Hi Mathew,
I would say no. Our system has freezed completely, it did not reboot. Our issue was caused by concurrent access to scheduler method that created a deadlock.I can see some out of memory messages, do you still have enough memory for Dom0?
2009/4/29 Mathew S. McCarrell mccarrms@gmail.com
Yeah, the Dom0 should have plenty of memory left since only 2-3 GB of memory is being used out of 12 GB installed. The out of memory messages were from the domU that I xm consoled into prior to shutting down that particular VM because it was out of memory.
Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mccarrms@clarkson.edu
On Wed, Apr 29, 2009 at 3:09 PM, Maros Timko timkom@gmail.com wrote:
Well, I'm actually not using a PAExen kernel but I don't believe that I need to be since I'm running the 64-bit version of CentOS. Am I mistaken in that assumption?
Thanks, Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mccarrms@clarkson.edu
On Wed, Apr 29, 2009 at 5:06 PM, Ljubomir Ljubojevic <office@plcomputers.net
wrote:
Mathew S. McCarrell wrote:
Matthew, you are right.
Also, the idea of running a PAE kernel on CentOS is non relevant
Karanbir, can you please, in short, explain to me current status of 64-bit CentOS compared to i386? Is it's maturity same as of i386?
I started to actively use CentOS when 4.2 was last version. My decision to use i386-only was based on issues with some (or many?) drivers like madwifi for AR5007, it's unavailability for older PC's, my impression in that time was that it was not stable enough, and the main thing was since I decided to create my own mirror of main and third party repositories for internal use, I went with i386.
What is actual gain in using X86_64? Performance in %? Main advantages beside performance? The real question is, does it pay off to spend 20-30 GB of HDD space for X86_64 if i386 does the job nicely? Just a sentence or two would be most appreciated.
Karanbir Singh wrote:
I've discovered what the issue is.
The machine is rebooting when a sector error occurs on one of the drives that is part of a software RAID where the VMs are currently being stored.
Thanks for the help though.
Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mccarrms@clarkson.edu
On Thu, Apr 30, 2009 at 3:19 PM, Ljubomir Ljubojevic <office@plcomputers.net
wrote:
So, I guess this wasn't just a hardware issue. I actually had another system crash.
This only appears to happen when I'm issuing xm commands over and over.
Any thoughts?
Thanks, Matt
-- Mathew S. McCarrell Clarkson University '10
mccarrms@gmail.com mccarrms@clarkson.edu
On Thu, Apr 30, 2009 at 6:27 PM, Mathew S. McCarrell mccarrms@gmail.comwrote: