[CentOS] Xen crash

Mon Oct 8 22:42:04 UTC 2007
Nicolas Sahlqvist <nicco77 at gmail.com>

Hi,

I'm new to this list and joined since I am volunteering as a tech
admin for a non profit organization called CouchSurfing (.com) where
we tried to move the web servers to Xen zones and this has proven
quite unstable since the our defined zones tends to crash on a daily
basis with the latest CentOS 5 Xen updates. The physical boxes have 2
quad core 1.6GHz Xeon CPU's and 4 GB RAM, there is currently only 1
domain on each box, configured with 2.2GB RAM.

Domain 0:
[root at nd10254 ~]# rpm -qa | grep xen
xen-libs-3.0.3-25.0.4.el5
kernel-xen-2.6.18-8.1.8.el5
kernel-xen-2.6.18-8.1.14.el5
xen-3.0.3-25.0.4.el5

Web1:
[root at web1 ~]# rpm -qa | grep xen
kernel-xen-2.6.18-8.1.8.el5
kernel-xen-2.6.18-8.1.14.el5

The Domain0 zone is indeed rock stable, while the Web1 etc. are
crashing daily with the 2.6.18-8.1.14 Xen kernel and the stack trace
we see after a few hours is as follows:

BUG: soft lockup detected on CPU#5!

Call Trace:
  <IRQ>  [<ffffffff802a76ad>] softlockup_tick+0xdb/0xed
 [<ffffffff8026ba66>] timer_interrupt+0x396/0x3f2
 [<ffffffff80210a87>] handle_IRQ_event+0x2d/0x60
 [<ffffffff802a79ec>] __do_IRQ+0xa4/0x105
 [<ffffffff802699b3>] do_IRQ+0xe7/0xf5
 [<ffffffff8038dde8>] evtchn_do_upcall+0x86/0xe0
 [<ffffffff8025cc1a>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026afe2>] raw_safe_halt+0x84/0xa8
 [<ffffffff802684f8>] xen_idle+0x38/0x4a
 [<ffffffff80247bcd>] cpu_idle+0x97/0xba

BUG: soft lockup detected on CPU#7!

Etc., etc etc, It does not crash the Xen zones directly, but clogs up
the Xen web1 console etc. We did not see this when running the
2.6.18-8.1.8 Xen kernel, instead the Xen zones crashed less frequent
with a out of memory problem as follows:

Call Trace:
 [<ffffffff802aeefc>] out_of_memory+0x4e/0x1d3
 [<ffffffff8020efe8>] __alloc_pages+0x229/0x2b2
 [<ffffffff8023fd5b>] __lock_page+0x5e/0x64
 [<ffffffff80232637>] read_swap_cache_async+0x42/0xd1
 [<ffffffff802b32a2>] swapin_readahead+0x4e/0x77
 [<ffffffff8020929d>] __handle_mm_fault+0xae3/0xf46
 [<ffffffff80260709>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff80262fe8>] do_page_fault+0xe48/0x11dc
 [<ffffffff80207138>] kmem_cache_free+0x77/0xca
 [<ffffffff8025cb6f>] error_exit+0x0/0x6e

We think the whole problem is how the kernel fails to handle resource
cloging (to many interrupts, heavy CPU and memory usage in the defined
zones etc.) from stubmbling on some fuzzy posts on the net, example:

http://article.gmane.org/gmane.comp.emulators.xen.user/26617

Is this problem known to you or new, any ideas on howto resolve it?


Regards,
Nicolas Sahlqvist
CouchSurfing,.com