[CentOS-virt] GFS2 hangs after one node going down

Fri Mar 22 15:27:53 UTC 2013
Digimer <lists at alteeve.ca>

On 03/22/2013 11:21 AM, Maurizio Giungato wrote:
> Il 22/03/2013 00:34, Digimer ha scritto:
>> On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
>>> Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
>>>> Il 21/03/2013 18:14, Digimer ha scritto:
>>>>> On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
>>>>>> Hi guys,
>>>>>> my goal is to create a reliable virtualization environment using
>>>>>> CentOS
>>>>>> 6.4 and KVM, I've three nodes and a clustered GFS2.
>>>>>> The enviroment is up and working, but I'm worry for the
>>>>>> reliability, if
>>>>>> I turn the network interface down on one node to simulate a crash
>>>>>> (for
>>>>>> example on the node "node6.blade"):
>>>>>> 1) GFS2 hangs (processes go in D state) until node6.blade get fenced
>>>>>> 2) not only node6.blade get fenced, but also node5.blade!
>>>>>> Help me to save my last neurons!
>>>>>> Thanks
>>>>>> Maurizio
>>>>> DLM, the distributed lock manager provided by the cluster, is
>>>>> designed to block when a known goes into an unknown state. It does
>>>>> not unblock until that node is confirmed to be fenced. This is by
>>>>> design. GFS2, rgmanager and clustered LVM all use DLM, so they will
>>>>> all block as well.
>>>>> As for why two nodes get fenced, you will need to share more about
>>>>> your configuration.
>>>> My configuration is very simple I attached cluster.conf and hosts
>>>> files.
>>>> This is the row I added in /etc/fstab:
>>>> /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
>>>> defaults,noatime,nodiratime 0 0
>>>> I set also fallback_to_local_locking = 0 in lvm.conf (but nothing
>>>> change)
>>>> PS: I had two virtualization enviroments working like a charm on
>>>> OCFS2, but since Centos 6.x I'm not able to install it, there is same
>>>> way to achieve the same results with GFS2 (with GFS2 sometime I've a
>>>> crash after only a "service network restart" [I've many interfaces
>>>> then this operation takes more than 10 seconds], with OCFS2 I've never
>>>> had this problem.
>>>> Thanks
>>> I attached my logs from /var/log/cluster/*
>> The configuration itself seems ok, though I think you can safely take
>> qdisk out to simplify things. That's neither here nor there though.
>> This concerns me:
>> Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent
>> fence_bladecenter result: error from agent
>> Mar 21 19:00:14 fenced fence lama6.blade failed
>> How are you triggering the failure(s)? The failed fence would
>> certainly help explain the delays. As I mentioned earlier, DLM is
>> designed to block when a node is in an unknowned state (failed but not
>> yet successfully fenced).
>> As an aside; I do my HA VMs using clustered LVM LVs as the backing
>> storage behind the VMs. GFS2 is an excellent file system, but it is
>> expensive. Putting your VMs directly on the LV takes them out of the
>> equation
> I used 'service network stop' to simulate the failure, the node get
> fenced through fence_bladecenter (BladeCenter HW)
> Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM
> LVs, I'm trying for many hours to reproduce the issue
> - only the node where I execute 'service network stop' get fenced
> - using fallback_to_local_locking = 0 in lvm.conf LVM LVs  remain
> writable also while fencing take place
> All seems to work like a charm now.
> I'd like to understand what was happening. I'll try for same day before
> trusting it.
> Thank you so much.
> Maurizio

Testing testing testing. It's good that you plan to test before 
trusting. I wish everyone had that philosophy!

The clustered locking for LVM comes into play for 
activating/inactivating, creating, deleting, resizing and so on. It does 
not affect what happens in an LV. That's why an LV remains writeable 
when a fence is pending. However, I feel this is safe because rgmanager 
won't recover a VM on another node until the lost node is fenced.


Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?