Hi,
I'm new to this list and joined since I am volunteering as a tech admin for a non profit organization called CouchSurfing (.com) where we tried to move the web servers to Xen zones and this has proven quite unstable since the our defined zones tends to crash on a daily basis with the latest CentOS 5 Xen updates. The physical boxes have 2 quad core 1.6GHz Xeon CPU's and 4 GB RAM, there is currently only 1 domain on each box, configured with 2.2GB RAM.
Domain 0: [root@nd10254 ~]# rpm -qa | grep xen xen-libs-3.0.3-25.0.4.el5 kernel-xen-2.6.18-8.1.8.el5 kernel-xen-2.6.18-8.1.14.el5 xen-3.0.3-25.0.4.el5
Web1: [root@web1 ~]# rpm -qa | grep xen kernel-xen-2.6.18-8.1.8.el5 kernel-xen-2.6.18-8.1.14.el5
The Domain0 zone is indeed rock stable, while the Web1 etc. are crashing daily with the 2.6.18-8.1.14 Xen kernel and the stack trace we see after a few hours is as follows:
BUG: soft lockup detected on CPU#5!
Call Trace: <IRQ> [<ffffffff802a76ad>] softlockup_tick+0xdb/0xed [<ffffffff8026ba66>] timer_interrupt+0x396/0x3f2 [<ffffffff80210a87>] handle_IRQ_event+0x2d/0x60 [<ffffffff802a79ec>] __do_IRQ+0xa4/0x105 [<ffffffff802699b3>] do_IRQ+0xe7/0xf5 [<ffffffff8038dde8>] evtchn_do_upcall+0x86/0xe0 [<ffffffff8025cc1a>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8026afe2>] raw_safe_halt+0x84/0xa8 [<ffffffff802684f8>] xen_idle+0x38/0x4a [<ffffffff80247bcd>] cpu_idle+0x97/0xba
BUG: soft lockup detected on CPU#7!
Etc., etc etc, It does not crash the Xen zones directly, but clogs up the Xen web1 console etc. We did not see this when running the 2.6.18-8.1.8 Xen kernel, instead the Xen zones crashed less frequent with a out of memory problem as follows:
Call Trace: [<ffffffff802aeefc>] out_of_memory+0x4e/0x1d3 [<ffffffff8020efe8>] __alloc_pages+0x229/0x2b2 [<ffffffff8023fd5b>] __lock_page+0x5e/0x64 [<ffffffff80232637>] read_swap_cache_async+0x42/0xd1 [<ffffffff802b32a2>] swapin_readahead+0x4e/0x77 [<ffffffff8020929d>] __handle_mm_fault+0xae3/0xf46 [<ffffffff80260709>] _spin_lock_irqsave+0x9/0x14 [<ffffffff80262fe8>] do_page_fault+0xe48/0x11dc [<ffffffff80207138>] kmem_cache_free+0x77/0xca [<ffffffff8025cb6f>] error_exit+0x0/0x6e
We think the whole problem is how the kernel fails to handle resource cloging (to many interrupts, heavy CPU and memory usage in the defined zones etc.) from stubmbling on some fuzzy posts on the net, example:
http://article.gmane.org/gmane.comp.emulators.xen.user/26617
Is this problem known to you or new, any ideas on howto resolve it?
Regards, Nicolas Sahlqvist CouchSurfing,.com