Hi Centos-virt,
I'm new to this list and joined since I am volunteering as a tech admin for a non profit organization called CouchSurfing (.com) where we tried to move the web servers to Xen zones and this has proven quite unstable since the our defined zones tends to crash on a daily basis with the latest CentOS 5 Xen updates. The physical boxes have 2 quad core 1.6GHz Xeon CPU's and 4 GB RAM, there is currently only 1 domain on each box, configured with 2.2GB RAM.
Domain 0: [root@nd10254 ~]# rpm -qa | grep xen xen-libs-3.0.3-25.0.4.el5 kernel-xen-2.6.18-8.1.8.el5 kernel-xen-2.6.18-8.1.14.el5 xen-3.0.3-25.0.4.el5
Web1: [root@web1 ~]# rpm -qa | grep xen kernel-xen-2.6.18-8.1.8.el5 kernel-xen-2.6.18-8.1.14.el5
The Domain0 zone is indeed rock stable, while the Web1 etc. are crashing daily with the 2.6.18-8.1.14 Xen kernel and the stack trace we see after a few hours is as follows:
BUG: soft lockup detected on CPU#5!
Call Trace: <IRQ> [<ffffffff802a76ad>] softlockup_tick+0xdb/0xed [<ffffffff8026ba66>] timer_interrupt+0x396/0x3f2 [<ffffffff80210a87>] handle_IRQ_event+0x2d/0x60 [<ffffffff802a79ec>] __do_IRQ+0xa4/0x105 [<ffffffff802699b3>] do_IRQ+0xe7/0xf5 [<ffffffff8038dde8>] evtchn_do_upcall+0x86/0xe0 [<ffffffff8025cc1a>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8026afe2>] raw_safe_halt+0x84/0xa8 [<ffffffff802684f8>] xen_idle+0x38/0x4a [<ffffffff80247bcd>] cpu_idle+0x97/0xba
BUG: soft lockup detected on CPU#7!
Etc., etc etc, It does not crash the Xen zones directly, but clogs up the Xen web1 console etc. We did not see this when running the 2.6.18-8.1.8 Xen kernel, instead the Xen zones crashed less frequent with a out of memory problem as follows:
Call Trace: [<ffffffff802aeefc>] out_of_memory+0x4e/0x1d3 [<ffffffff8020efe8>] __alloc_pages+0x229/0x2b2 [<ffffffff8023fd5b>] __lock_page+0x5e/0x64 [<ffffffff80232637>] read_swap_cache_async+0x42/0xd1 [<ffffffff802b32a2>] swapin_readahead+0x4e/0x77 [<ffffffff8020929d>] __handle_mm_fault+0xae3/0xf46 [<ffffffff80260709>] _spin_lock_irqsave+0x9/0x14 [<ffffffff80262fe8>] do_page_fault+0xe48/0x11dc [<ffffffff80207138>] kmem_cache_free+0x77/0xca [<ffffffff8025cb6f>] error_exit+0x0/0x6e
We think the whole problem is how the kernel fails to handle resource cloging (to many interrupts, heavy CPU and memory usage in the defined zones etc.) from stubmbling on some fuzzy posts on the net, example:
http://article.gmane.org/gmane.comp.emulators.xen.user/26617
Is this problem known to you or new, any ideas on howto resolve it?
Regards, Nicolas Sahlqvist CouchSurfing,.com
Hi Daniel,
So you are saying that we should run more zones with less CPU's to avoid the problem? Well, the budget for RAM would not go too well with that so is there any work done on a fix that you are aware of?
Regards, Nicolas Sahlqvist CouchSurfing.com
On 10/9/07, Daniel de Kok danieldk@pobox.com wrote:
Hi,
On Tue, 2007-10-09 at 00:40 +0200, Nicolas Sahlqvist wrote:
BUG: soft lockup detected on CPU#5!
I have seen this on domUs with vcpus set higher than 1.
-- Daniel
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
On Tue, 2007-10-09 at 10:47 +0200, Nicolas Sahlqvist wrote:
So you are saying that we should run more zones with less CPU's to avoid the problem? Well, the budget for RAM would not go too well with that so is there any work done on a fix that you are aware of?
Well, I was rather wondering if it is the same bug. Even if so, it should really be filed in our and the upstream bug trackers. I should have done that myself, but have had little time to do so.
-- Daniel
Hi Daniel,
I also got a full time work so I know how it is, on what URL can I found the bug tracker?
I think there are 2 bugs, one being the CPU lockup and the 2nd where Xen hangs under high load, this is what I saw when it hung last time:
printk: 220 messages suppressed. httpd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Call Trace: [<ffffffff802aee7d>] out_of_memory+0x4e/0x1d3 [<ffffffff8020efe8>] __alloc_pages+0x229/0x2b2 [<ffffffff8021298b>] __do_page_cache_readahead+0xd0/0x21c [<ffffffff802284d8>] sync_page+0x0/0x42 [<ffffffff88082c61>] :dm_mod:dm_any_congested+0x38/0x3f [<ffffffff80213240>] filemap_nopage+0x148/0x322 [<ffffffff80208b94>] __handle_mm_fault+0x3da/0xf46 [<ffffffff802606e9>] _spin_lock_irqsave+0x9/0x14 [<ffffffff80262fc8>] do_page_fault+0xe48/0x11dc [<ffffffff8022c50c>] mntput_no_expire+0x19/0x89 [<ffffffff80245fbe>] sys_chdir+0x55/0x62 [<ffffffff8025cb6f>] error_exit+0x0/0x6e
DMA per-cpu: cpu 0 hot: high 186, batch 31 used:170 cpu 0 cold: high 62, batch 15 used:55 cpu 1 hot: high 186, batch 31 used:28 cpu 1 cold: high 62, batch 15 used:37 cpu 2 hot: high 186, batch 31 used:24 cpu 2 cold: high 62, batch 15 used:47 cpu 3 hot: high 186, batch 31 used:14 cpu 3 cold: high 62, batch 15 used:52
Call Trace: [<ffffffff802aee7d>] out_of_memory+0x4e/0x1d3 [<ffffffff8020efe8>] __alloc_pages+0x229/0x2b2 [<ffffffff8021298b>] __do_page_cache_readahead+0xd0/0x21c [<ffffffff8025f528>] __wait_on_bit_lock+0x5b/0x66 [<ffffffff88082c61>] :dm_mod:dm_any_congested+0x38/0x3f [<ffffffff80213240>] filemap_nopage+0x148/0x322 [<ffffffff80208b94>] __handle_mm_fault+0x3da/0xf46 [<ffffffff802606e9>] _spin_lock_irqsave+0x9/0x14 [<ffffffff80262fc8>] do_page_fault+0xe48/0x11dc [<ffffffff80233a31>] do_setitimer+0x45f/0x4c7 [<ffffffff80245fbe>] sys_chdir+0x55/0x62 [<ffffffff8025cb6f>] error_exit+0x0/0x6e
DMA per-cpu: cpu 0 hot: high 186, batch 31 used:170 cpu 0 cold: high 62, batch 15 used:55 cpu 1 hot: high 186, batch 31 used:28 cpu 1 cold: high 62, batch 15 used:37 cpu 2 hot: high 186, batch 31 used:24 cpu 2 cold: high 62, batch 15 used:47 cpu 3 hot: high 186, batch 31 used:14 cpu 3 cold: high 62, batch 15 used:52 cpu 4 hot: high 186, batch 31 used:25 cpu 4 cold: high 62, batch 15 used:13 cpu 5 hot: high 186, batch 31 used:21 cpu 5 cold: high 62, batch 15 used:54 cpu 6 hot: high 186, batch 31 used:17 cpu 6 cold: high 62, batch 15 used:45 cpu 7 hot: high 186, batch 31 used:17 cpu 7 cold: high 62, batch 15 used:45 DMA32 per-cpu: empty Normal per-cpu: empty HighMem per-cpu: empty Free pages: 6020kB (0kB HighMem) Active:300020 inactive:204495 dirty:0 writeback:0 unstable:0 free:1505 slab:7395 mapped:2 pagetables:24274 DMA free:6020kB min:6020kB low:7524kB high:9028kB active:1200080kB inactive:817980kB present:2265088kB pages_scanned:19172661 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 65*4kB 0*8kB 0*16kB 8*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 6020kB DMA32: empty Normal: empty HighMem: empty Swap cache: add 217269, delete 217269, find 2978/10987, race 0+1 Free swap = 0kB Total swap = 557048kB
All swap is used so no memory free, this would make a non virtual box very slow (almost dead), but not crash like this so could it be something wrong moving swap pages back and forth that causes this problem..?
Regards, Nicolas Sahlqvist CouchSurfing.com
On 10/9/07, Daniel de Kok danieldk@pobox.com wrote:
On Tue, 2007-10-09 at 10:47 +0200, Nicolas Sahlqvist wrote:
So you are saying that we should run more zones with less CPU's to avoid the problem? Well, the budget for RAM would not go too well with that so is there any work done on a fix that you are aware of?
Well, I was rather wondering if it is the same bug. Even if so, it should really be filed in our and the upstream bug trackers. I should have done that myself, but have had little time to do so.
-- Daniel
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt