[CentOS-virt] Machine freeze

Fri Apr 3 15:18:43 UTC 2009
James Roman <james_roman at ssaihq.com>

I've had to deal with issues like these in the past and I can say they 
always suck. Normally, the whole OS freezes due to a hardware issue. 
Isolating the cause is extremely time consuming. If it happens on a 
regular basis, (I.E. every 60 or maybe 90 days) the most likely culprit 
is the DRAC card. There is a known issue where a virtual USB floppy or 
CD device spontaneously disappears from the OS, causing an OS freeze. I 
believe there is a kernel parameter to pass and a firmware upgrade to 
apply.

When addressing any hardware issue, the default response from the vendor 
will always be "have you upgraded the BIOS and firmware on all the 
cards"? In general, that will be your first step. The next canned reply 
will be "do you have any third-party cards or equipment?" (External USB 
drives, third-party memory, unsupported cards, etc.). If so, you will be 
told to remove them or you're on your own.

Check the controller card logs (BIOS, DRAC, RAID Controller, etc.) and 
run the Dell diagnostic tools on the server. Make sure you run a full 
check on the memory. (You might also try swap memory DIMM positions to 
see if the behavior changes.) The dmesg log is your friend.

Investigate setting up net-dump to create a crash dump file on a remote 
system.

A remote monitoring system, collecting system logs, snmp traps and 
performing active monitoring can be useful in identifying any events 
that lead up to the system freeze. (I.E. Memory slowly leaking away, 
processor spiking, etc.) If you have a DRAC or BMC card, configure it 
with an IP address and to send SNMP traps to a monitoring system.

Pay attention to any physical changes that coincide with the freeze. 
(I.E. fans are running full bore, which normally means some instruction 
ran into a loop.)

Just a note, you really want your Xen system to be running bare bones. 
Do not install any unnecessary packages. It just complicates your 
troubleshooting in this instance.

Configuring the server to send syslog messages to tty12 or serial 
console to monitor on a another system) can sometimes be helpful to see 
what the last write was supposed to be (if the disk is dying before a 
write). Add the following to syslog.conf and leave your console on tty12 
(since you won't be able to change it after a freeze).

# Log everything to tty12
*.*                                                     /dev/tty12


I thought I read that the PAE kernel is superficial (since 5.x), but 
maybe that is with Cent 5.3.

Maros TIMKO wrote:
>
> Hi all,
>
> we are running CentOS 5.2 Xen virtualization system with the latest 
> CentOS packages with couple of VMs on DELL PowerEdge. "Sometimes" the 
> whole machine freezes without anything in log files, anything on the 
> console. "Sometimes" really means we cannot define why or when. 
> Sometimes the machine was idle with just one VM, sometimes quite busy 
> with couple of VMs.
>
>  Has anybody had the same experience? If yes, any hints on how to 
> resolve it or how to trace the cause?
>
>  
>
> Thanks. 
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> http://lists.centos.org/mailman/listinfo/centos-virt
>