[CentOS-virt] GFS2 hangs after one node going down

Thu Mar 21 23:34:09 UTC 2013

On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
> Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
>> Il 21/03/2013 18:14, Digimer ha scritto:
>>> On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
>>>> Hi guys,
>>>>
>>>> my goal is to create a reliable virtualization environment using CentOS
>>>> 6.4 and KVM, I've three nodes and a clustered GFS2.
>>>>
>>>> The enviroment is up and working, but I'm worry for the reliability, if
>>>> I turn the network interface down on one node to simulate a crash (for
>>>> example on the node "node6.blade"):
>>>>
>>>> 1) GFS2 hangs (processes go in D state) until node6.blade get fenced
>>>> 2) not only node6.blade get fenced, but also node5.blade!
>>>>
>>>> Help me to save my last neurons!
>>>>
>>>> Thanks
>>>> Maurizio
>>>
>>> DLM, the distributed lock manager provided by the cluster, is
>>> designed to block when a known goes into an unknown state. It does
>>> not unblock until that node is confirmed to be fenced. This is by
>>> design. GFS2, rgmanager and clustered LVM all use DLM, so they will
>>> all block as well.
>>>
>>> As for why two nodes get fenced, you will need to share more about
>>> your configuration.
>>>
>> My configuration is very simple I attached cluster.conf and hosts files.
>> This is the row I added in /etc/fstab:
>> /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
>> defaults,noatime,nodiratime 0 0
>>
>> I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)
>>
>> PS: I had two virtualization enviroments working like a charm on
>> OCFS2, but since Centos 6.x I'm not able to install it, there is same
>> way to achieve the same results with GFS2 (with GFS2 sometime I've a
>> crash after only a "service network restart" [I've many interfaces
>> then this operation takes more than 10 seconds], with OCFS2 I've never
>> had this problem.
>>
>> Thanks
> I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take 
qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter 
result: error from agent
Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly 
help explain the delays. As I mentioned earlier, DLM is designed to 
block when a node is in an unknowned state (failed but not yet 
successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing 
storage behind the VMs. GFS2 is an excellent file system, but it is 
expensive. Putting your VMs directly on the LV takes them out of the 
equation.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?