[CentOS-virt] VMs died due to hanging httpd processes

Sun Dec 12 14:40:35 UTC 2010
Dennis Jacobfeuerborn <dennisml at conversis.de>

about an hour ago two web-serving VMs died at the same time with the 
following error on the console:

INFO: task httpd:4304 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
httpd         D 00af1f714d1112e2     0  4304  22471          4305  4303 (NOTLB)
  ffff88006574bdc8  0000000000000282  00000000000041f8  ffff88006574bea8
  000000000000000a  ffff88009747b820  ffffffff804f4b00  00000000001a5eee
  ffff88009747ba08  ffff880095be5015
Call Trace:
  [<ffffffff8022d03c>] mntput_no_expire+0x19/0x89
  [<ffffffff8020eeae>] link_path_walk+0xa6/0xb2
  [<ffffffff80263a7e>] __mutex_lock_slowpath+0x60/0x9b
  [<ffffffff80223f33>] __path_lookup_intent_open+0x56/0x97
  [<ffffffff80263ac8>] .text.lock.mutex+0xf/0x14
  [<ffffffff8021b52d>] open_namei+0xea/0x6d5
  [<ffffffff8029cb30>] set_process_cpu_timer+0xc7/0xd2
  [<ffffffff80227caa>] do_filp_open+0x1c/0x38
  [<ffffffff8021a364>] do_sys_open+0x44/0xbe
  [<ffffffff802602f9>] tracesys+0xab/0xb6

Monitoring show that in a timeframe of about 3 minutes the load on the 
systems shot up to over 400 before they died. Since MaxClients is set to 
512 I suspect that the processes had a mass-lockup with each process 
constantly causing a load of 1 (similar to what happens when a process 
hangs on an NFS mount point). One of the two VMs acts as a NFS server and 
exports directories to the other VM (but doesn't mount any external NFS 
sources itself).

What is strange is that both system locked up at the same time since they 
are running on two different physical hosts. The hosts run Centos 5.3 while 
the VMs run Centos 5.5 as PV Xen guests.

Since the call trace looks identical on both cases I wonder if anyone has 
an idea what exactly went wrong here?