I've had to deal with issues like these in the past and I can say they always suck. Normally, the whole OS freezes due to a hardware issue. Isolating the cause is extremely time consuming. If it happens on a regular basis, (I.E. every 60 or maybe 90 days) the most likely culprit is the DRAC card. There is a known issue where a virtual USB floppy or CD device spontaneously disappears from the OS, causing an OS freeze. I believe there is a kernel parameter to pass and a firmware upgrade to apply. When addressing any hardware issue, the default response from the vendor will always be "have you upgraded the BIOS and firmware on all the cards"? In general, that will be your first step. The next canned reply will be "do you have any third-party cards or equipment?" (External USB drives, third-party memory, unsupported cards, etc.). If so, you will be told to remove them or you're on your own. Check the controller card logs (BIOS, DRAC, RAID Controller, etc.) and run the Dell diagnostic tools on the server. Make sure you run a full check on the memory. (You might also try swap memory DIMM positions to see if the behavior changes.) The dmesg log is your friend. Investigate setting up net-dump to create a crash dump file on a remote system. A remote monitoring system, collecting system logs, snmp traps and performing active monitoring can be useful in identifying any events that lead up to the system freeze. (I.E. Memory slowly leaking away, processor spiking, etc.) If you have a DRAC or BMC card, configure it with an IP address and to send SNMP traps to a monitoring system. Pay attention to any physical changes that coincide with the freeze. (I.E. fans are running full bore, which normally means some instruction ran into a loop.) Just a note, you really want your Xen system to be running bare bones. Do not install any unnecessary packages. It just complicates your troubleshooting in this instance. Configuring the server to send syslog messages to tty12 or serial console to monitor on a another system) can sometimes be helpful to see what the last write was supposed to be (if the disk is dying before a write). Add the following to syslog.conf and leave your console on tty12 (since you won't be able to change it after a freeze). # Log everything to tty12 *.* /dev/tty12 I thought I read that the PAE kernel is superficial (since 5.x), but maybe that is with Cent 5.3. Maros TIMKO wrote: > > Hi all, > > we are running CentOS 5.2 Xen virtualization system with the latest > CentOS packages with couple of VMs on DELL PowerEdge. "Sometimes" the > whole machine freezes without anything in log files, anything on the > console. "Sometimes" really means we cannot define why or when. > Sometimes the machine was idle with just one VM, sometimes quite busy > with couple of VMs. > > Has anybody had the same experience? If yes, any hints on how to > resolve it or how to trace the cause? > > > > Thanks. > > ------------------------------------------------------------------------ > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > http://lists.centos.org/mailman/listinfo/centos-virt >