I have a guest that keeps crashing and want to automatically reboot it when it crashes. See:
xen PV guest kernel 2.6.32 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724
if you want to look at the details on the crashing.
Anyway, I boot the guest with the kernel command line parameter:
hung_task_panic=1
I have kernel.panic = 15 in the guest /etc/sysctl.conf
In the guest config file in dom0 I have:
on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart"
The guest manages to panic when it detects the hung tasks and I get this on the guest console:
Kernel panic - not syncing: hung_task: blocked tasks Rebooting in 15 seconds..
However, it never restarts. It just hangs around until I do a
xm destroy <guest> xm create <guest>
Have I missed something?
----- "Norman Gaywood" ngaywood@une.edu.au wrote:
I have a guest that keeps crashing and want to automatically reboot it when it crashes. See:
xen PV guest kernel 2.6.32 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724
if you want to look at the details on the crashing.
If your guest's state is corrupted, you can't rely on its behavior. For that reason, you should set up a watchdog and only rely on the panic behavior as a preliminary measure.
On Sat, Mar 13, 2010 at 10:21:14PM -0600, Christopher G. Stach II wrote:
----- "Norman Gaywood" ngaywood@une.edu.au wrote:
I have a guest that keeps crashing and want to automatically reboot it when it crashes. See:
xen PV guest kernel 2.6.32 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724
if you want to look at the details on the crashing.
If your guest's state is corrupted, you can't rely on its behavior. For that reason, you should set up a watchdog and only rely on the panic behavior as a preliminary measure.
I figured that I had setup a watchdog with the hung_task_panic=1 guest parameter. Is there another way of setting up a watchdog in this case?
On Sun, Mar 14, 2010 at 09:03:23AM +1100, Norman Gaywood wrote:
I have a guest that keeps crashing and want to automatically reboot it when it crashes. See:
xen PV guest kernel 2.6.32 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724
if you want to look at the details on the crashing.
Btw please see: http://wiki.xensource.com/xenwiki/XenCommonProblems
Especially the chapter about debugging crashed guests. It would be very helpful to grab a stacktrace of the crashed guest to debug it.
So set on_crash=preserve for the guest, and then use xenctx (with the guest kernel System.map) to get the stack trace..
Redhat bugzilla seems to be down, so I can't check the details about the bugreport. Is the guest single-vcpu or multi-vcpu? 32bit or 64bit?
-- Pasi
Anyway, I boot the guest with the kernel command line parameter:
hung_task_panic=1
I have kernel.panic = 15 in the guest /etc/sysctl.conf
In the guest config file in dom0 I have:
on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart"
The guest manages to panic when it detects the hung tasks and I get this on the guest console:
Kernel panic - not syncing: hung_task: blocked tasks Rebooting in 15 seconds..
However, it never restarts. It just hangs around until I do a
xm destroy <guest> xm create <guest>
Have I missed something?
-- Norman Gaywood, Computer Systems Officer University of New England, Armidale, NSW 2351, Australia
ngaywood@une.edu.au Phone: +61 (0)2 6773 3337 http://mcs.une.edu.au/~norm Fax: +61 (0)2 6773 3312
Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
Thanks Pasi.
On Sun, Mar 14, 2010 at 03:16:13PM +0200, Pasi Kärkkäinen wrote:
On Sun, Mar 14, 2010 at 09:03:23AM +1100, Norman Gaywood wrote:
I have a guest that keeps crashing and want to automatically reboot it when it crashes. See:
xen PV guest kernel 2.6.32 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724
if you want to look at the details on the crashing.
Btw please see: http://wiki.xensource.com/xenwiki/XenCommonProblems
Especially the chapter about debugging crashed guests. It would be very helpful to grab a stacktrace of the crashed guest to debug it.
I've already done this, there are some stack traces in the bugzilla entry.
Andrew Jones seems to think that the problem in this case is actually in the Xen hypervisor and not the guest. Problem is I seem to be the only one hitting this.
Redhat bugzilla seems to be down, so I can't check the details about the bugreport. Is the guest single-vcpu or multi-vcpu? 32bit or 64bit?
64bit multi-vcpu. The guest is quite heavyweight, 30GB of memory and 12vcpu. It's a LTSP server designed to handle lots of graphical logins for computer science students. This, I guess is not a common workload.
----- "Norman Gaywood" ngaywood@une.edu.au wrote:
64bit multi-vcpu. The guest is quite heavyweight, 30GB of memory and 12vcpu. It's a LTSP server designed to handle lots of graphical logins for computer science students. This, I guess is not a common workload.
I wouldn't say that it's an uncommon workload, or VM configuration, at all. However, it is an uncommon kernel. Is there any reason that you need to use that one? Can/Does it work with something more "approved"?
On 15 March 2010 10:12, Christopher G. Stach II cgs@ldsys.net wrote:
----- "Norman Gaywood" ngaywood@une.edu.au wrote:
64bit multi-vcpu. The guest is quite heavyweight, 30GB of memory and 12vcpu. It's a LTSP server designed to handle lots of graphical logins for computer science students. This, I guess is not a common workload.
I wouldn't say that it's an uncommon workload, or VM configuration, at all. However, it is an uncommon kernel. Is there any reason that you need to use that one? Can/Does it work with something more "approved"?
LTSP setups are either Fedora or Ubuntu which run about the same vintage of kernel.
Thing is that the 2.6.3? kernels that are supposed to work as a Xen PV guest have been around a long time now. None of them seem to work in my case. The oldest I tried was 2.6.30.
Installing say a Centos kernel on fedora does not look to be an option. Ubuntu seem to have pretty much the same software/kernel as fedora. It's a lot of work to build sometime like our current setup on Ubuntu only to discover it probably has the same problem. Note also this problem was not reproduced in testing until very recently (see the bugzilla).
Another way would be to install Fedora/Ubuntu on the bare metal like we have on older versions of this system. But sigh, I hit a different bug going that way with "Enterprise Hardware" not supported in modern kernels.
So these 2.6.3? kernels are supposed to work as PV guests. And in any event, it looks like the problem might be in the Xen Hypervisor anyway. The bugzilla was moved from a Fedora bug the a RH EL bug (not by me).
My thinking at the moment for a way forward is to look at switching my guest to KVM. But that's even more bleeding edge.
----- "Norman Gaywood" ngaywood@une.edu.au wrote:
LTSP setups are either Fedora or Ubuntu which run about the same vintage of kernel.
There isn't a whole lot special happening on the server side for LTSP. You shouldn't be tied to any specific kernel version as long as you have NFS, dhcpd, tftpd, and an X server running and accepting connections. I ran it with a CentOS 5 server in the past without any problems native to LTSP. (The only issue you may run into are a build environment if you muck around with the client binaries, and you don't need Xen for that, nor do you want a build environment on your production system.) You could even roll your own with something like Thinstation or NX (if you want headaches). All in all, it's not a big deal if you build it completely from scratch, no matter what kind of clients you are serving. I actually prefer it nice and simple this way, because LTSP development has been than a little bit messed up for years and they now tie you to their line of thinking and the associated problems.
Now, if you want something specific to a recent Fedora version on the server, then you're just asking for trouble. Fedora is nowhere near stable and you're just going to tear your hair out maintaining it after you tear your hair out getting it to run. If that's the case, avoid Xen and any sort of stability-oriented "enterprise" stuff and run it on the compatible bare metal (after you acquire it.) Is it a need and is it really worth it to do it that way instead of getting your Fedora environment dependencies and building them on something stable or getting them from EPEL or another repo?
Installing say a Centos kernel on fedora does not look to be an option.
Yeah, not unless you go back to FC6. :)
On Sun, Mar 14, 2010 at 07:33:43PM -0500, Christopher G. Stach II wrote:
----- "Norman Gaywood" ngaywood@une.edu.au wrote:
LTSP setups are either Fedora or Ubuntu which run about the same vintage of kernel.
There isn't a whole lot special happening on the server side for LTSP.
[deleted fine description of basic LTSP requirements]
Actually we already have the LTSP basic server requirements separated out and running on bare metal. As you say, there is not much to do here.
The unstable VM is running the desktops for the LTSP. Also a lightly loaded httpd and a lightly loaded samba server. There is also a light load of NFS which I have almost disabled while this problem persists.
Now, if you want something specific to a recent Fedora version on the server, then you're just asking for trouble.
Well yes, seems I've hit trouble this time. Was expecting some trouble with the desktop apps and was happy to deal with that. I was not expecting significant kernel trouble however.
Rebuilding a modern desktop distro with an older kernel is a lot of work. That's what distributions are for right?
On Mon, Mar 15, 2010 at 11:06:01AM +1100, Norman Gaywood wrote:
On 15 March 2010 10:12, Christopher G. Stach II cgs@ldsys.net wrote:
----- "Norman Gaywood" ngaywood@une.edu.au wrote:
64bit multi-vcpu. The guest is quite heavyweight, 30GB of memory and 12vcpu. It's a LTSP server designed to handle lots of graphical logins for computer science students. This, I guess is not a common workload.
I wouldn't say that it's an uncommon workload, or VM configuration, at all. However, it is an uncommon kernel. Is there any reason that you need to use that one? Can/Does it work with something more "approved"?
LTSP setups are either Fedora or Ubuntu which run about the same vintage of kernel.
Thing is that the 2.6.3? kernels that are supposed to work as a Xen PV guest have been around a long time now. None of them seem to work in my case. The oldest I tried was 2.6.30.
Installing say a Centos kernel on fedora does not look to be an option. Ubuntu seem to have pretty much the same software/kernel as fedora. It's a lot of work to build sometime like our current setup on Ubuntu only to discover it probably has the same problem. Note also this problem was not reproduced in testing until very recently (see the bugzilla).
Another way would be to install Fedora/Ubuntu on the bare metal like we have on older versions of this system. But sigh, I hit a different bug going that way with "Enterprise Hardware" not supported in modern kernels.
So these 2.6.3? kernels are supposed to work as PV guests. And in any event, it looks like the problem might be in the Xen Hypervisor anyway. The bugzilla was moved from a Fedora bug the a RH EL bug (not by me).
Sorry I can't remember if I already asked if you tried upgrading to Xen 3.4.2 from http://gitco.de/repo/ ?
-- Pasi
64bit multi-vcpu. The guest is quite heavyweight, 30GB of
memory and
12vcpu. It's a LTSP server designed to handle lots of
graphical logins
for computer science students. This, I guess is not a
common workload.
-- Norman Gaywood, Computer Systems Officer University of New England, Armidale, NSW 2351, Australia
I was wondering if you wouldn't mind describing the hardware this runs on?
On Sun, Mar 14, 2010 at 06:35:02PM -0600, compdoc wrote:
64bit multi-vcpu. The guest is quite heavyweight, 30GB of memory and 12vcpu. It's a LTSP server designed to handle lots of graphical logins for computer science students. This, I guess is not a common workload.
I was wondering if you wouldn't mind describing the hardware this runs on?
Sure, more detail at:
https://bugzilla.redhat.com/show_bug.cgi?id=550724
this cut'n'pasted from there (comment #13):
This hardware is relatively new, just over 6 months old. The main idea of the system is to be a development environment for math/comp sci students. It's setup to deal with up to 60 LTSP (Linux Terminal Server Project) terminals and nxclient/ssh connections. It replaces a 4 year old HP server with Fedora 10 that did the same thing. The old HP setup ran fedora 10, at its end of life, on the bare metal. The new server was supposed to make use of virtualization.
The hardware of the new dom0 server is an IBM x3850 M2 with 4 Xeon Quad Core E7330 80W processors, 64GB of memory. Two IBM 73.4GB 2.5in 10K HS SAS HDD makeup the system storage for dom0.
At the moment we are running Centos 5.4 with the latest kernel I could find: kernel-xen-2.6.18-186.el5
SAS attached for main storage is an IBM DS3200 with 12 750GB SATA HDD configured as one large raid 6 drive. We break up the large drive using LV.
Various attachments of config and dmesg of dom0 to follow.
I see no strange error messages in the dom0 (including /var/log/messages) except for the:
(XEN) traps.c:1878:d5 Domain attempted WRMSR 000000000000008b from 00000021:00000000 to 00000000:00000000.
reported by "xm dmesg"
We don't use NFS in dom0 and the network around here is pretty much stable now.
One thing to note. Originally we had hoped to run a fedora kernel as a dom0. However we struck bug #541615 (Calgary: DMA error on CalIOC2 PHB 0x3) and so were unable to get the attached storage to pass disk tests. RH enterprise/Centos is rock solid as a dom0 and passes any disk tests we can throw at it.
Try this in the guest:
echo "1" > /proc/sys/kernel/panic_on_oops echo "5" > /proc/sys/kernel/panic
On 03/13/2010 05:03 PM, Norman Gaywood wrote:
I have a guest that keeps crashing and want to automatically reboot it when it crashes. See:
xen PV guest kernel 2.6.32 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724
if you want to look at the details on the crashing.
Anyway, I boot the guest with the kernel command line parameter:
hung_task_panic=1
I have kernel.panic = 15 in the guest /etc/sysctl.conf
In the guest config file in dom0 I have:
on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart"
The guest manages to panic when it detects the hung tasks and I get this on the guest console:
Kernel panic - not syncing: hung_task: blocked tasks Rebooting in 15 seconds..
However, it never restarts. It just hangs around until I do a
xm destroy<guest> xm create<guest>
Have I missed something?
Thanks Dan,
On Sun, Mar 14, 2010 at 05:06:32PM -0400, Dan Hrabarchuk wrote:
Try this in the guest:
echo "1" > /proc/sys/kernel/panic_on_oops
This might be a bit too much. There is another "harmless" oops that occasionally happens related to ext4 quotas. I wouldn't want it to reboot on those.
The hung_task_panic=1 does in fact cause the panic when I want it. However the problem seems to be that it does not reboot after the panic.
echo "5" > /proc/sys/kernel/panic
I already have this, although I have a value of 15.
On 03/13/2010 05:03 PM, Norman Gaywood wrote:
I have a guest that keeps crashing and want to automatically reboot it when it crashes. See:
xen PV guest kernel 2.6.32 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724
if you want to look at the details on the crashing.
Anyway, I boot the guest with the kernel command line parameter:
hung_task_panic=1
I have kernel.panic = 15 in the guest /etc/sysctl.conf
In the guest config file in dom0 I have:
on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart"
The guest manages to panic when it detects the hung tasks and I get this on the guest console:
Kernel panic - not syncing: hung_task: blocked tasks Rebooting in 15 seconds..
However, it never restarts. It just hangs around until I do a
xm destroy<guest> xm create<guest>
Have I missed something?