[CentOS-virt] Machine freeze
James Roman
james_roman at ssaihq.com
Fri Apr 3 16:18:43 UTC 2009
I've had to deal with issues like these in the past and I can say they
always suck. Normally, the whole OS freezes due to a hardware issue.
Isolating the cause is extremely time consuming. If it happens on a
regular basis, (I.E. every 60 or maybe 90 days) the most likely culprit
is the DRAC card. There is a known issue where a virtual USB floppy or
CD device spontaneously disappears from the OS, causing an OS freeze. I
believe there is a kernel parameter to pass and a firmware upgrade to
apply.
When addressing any hardware issue, the default response from the vendor
will always be "have you upgraded the BIOS and firmware on all the
cards"? In general, that will be your first step. The next canned reply
will be "do you have any third-party cards or equipment?" (External USB
drives, third-party memory, unsupported cards, etc.). If so, you will be
told to remove them or you're on your own.
Check the controller card logs (BIOS, DRAC, RAID Controller, etc.) and
run the Dell diagnostic tools on the server. Make sure you run a full
check on the memory. (You might also try swap memory DIMM positions to
see if the behavior changes.) The dmesg log is your friend.
Investigate setting up net-dump to create a crash dump file on a remote
system.
A remote monitoring system, collecting system logs, snmp traps and
performing active monitoring can be useful in identifying any events
that lead up to the system freeze. (I.E. Memory slowly leaking away,
processor spiking, etc.) If you have a DRAC or BMC card, configure it
with an IP address and to send SNMP traps to a monitoring system.
Pay attention to any physical changes that coincide with the freeze.
(I.E. fans are running full bore, which normally means some instruction
ran into a loop.)
Just a note, you really want your Xen system to be running bare bones.
Do not install any unnecessary packages. It just complicates your
troubleshooting in this instance.
Configuring the server to send syslog messages to tty12 or serial
console to monitor on a another system) can sometimes be helpful to see
what the last write was supposed to be (if the disk is dying before a
write). Add the following to syslog.conf and leave your console on tty12
(since you won't be able to change it after a freeze).
# Log everything to tty12
*.* /dev/tty12
I thought I read that the PAE kernel is superficial (since 5.x), but
maybe that is with Cent 5.3.
Maros TIMKO wrote:
>
> Hi all,
>
> we are running CentOS 5.2 Xen virtualization system with the latest
> CentOS packages with couple of VMs on DELL PowerEdge. "Sometimes" the
> whole machine freezes without anything in log files, anything on the
> console. "Sometimes" really means we cannot define why or when.
> Sometimes the machine was idle with just one VM, sometimes quite busy
> with couple of VMs.
>
> Has anybody had the same experience? If yes, any hints on how to
> resolve it or how to trace the cause?
>
>
>
> Thanks.
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> http://lists.centos.org/mailman/listinfo/centos-virt
>
More information about the CentOS-virt
mailing list