[CentOS-virt] GFS2 hangs after one node going down
Maurizio Giungato
m.giungato at pixnamic.com
Fri Mar 22 15:21:51 UTC 2013
Il 22/03/2013 00:34, Digimer ha scritto:
> On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
>> Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
>>> Il 21/03/2013 18:14, Digimer ha scritto:
>>>> On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
>>>>> Hi guys,
>>>>>
>>>>> my goal is to create a reliable virtualization environment using
>>>>> CentOS
>>>>> 6.4 and KVM, I've three nodes and a clustered GFS2.
>>>>>
>>>>> The enviroment is up and working, but I'm worry for the
>>>>> reliability, if
>>>>> I turn the network interface down on one node to simulate a crash
>>>>> (for
>>>>> example on the node "node6.blade"):
>>>>>
>>>>> 1) GFS2 hangs (processes go in D state) until node6.blade get fenced
>>>>> 2) not only node6.blade get fenced, but also node5.blade!
>>>>>
>>>>> Help me to save my last neurons!
>>>>>
>>>>> Thanks
>>>>> Maurizio
>>>>
>>>> DLM, the distributed lock manager provided by the cluster, is
>>>> designed to block when a known goes into an unknown state. It does
>>>> not unblock until that node is confirmed to be fenced. This is by
>>>> design. GFS2, rgmanager and clustered LVM all use DLM, so they will
>>>> all block as well.
>>>>
>>>> As for why two nodes get fenced, you will need to share more about
>>>> your configuration.
>>>>
>>> My configuration is very simple I attached cluster.conf and hosts
>>> files.
>>> This is the row I added in /etc/fstab:
>>> /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
>>> defaults,noatime,nodiratime 0 0
>>>
>>> I set also fallback_to_local_locking = 0 in lvm.conf (but nothing
>>> change)
>>>
>>> PS: I had two virtualization enviroments working like a charm on
>>> OCFS2, but since Centos 6.x I'm not able to install it, there is same
>>> way to achieve the same results with GFS2 (with GFS2 sometime I've a
>>> crash after only a "service network restart" [I've many interfaces
>>> then this operation takes more than 10 seconds], with OCFS2 I've never
>>> had this problem.
>>>
>>> Thanks
>> I attached my logs from /var/log/cluster/*
>
> The configuration itself seems ok, though I think you can safely take
> qdisk out to simplify things. That's neither here nor there though.
>
> This concerns me:
>
> Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent
> fence_bladecenter result: error from agent
> Mar 21 19:00:14 fenced fence lama6.blade failed
>
> How are you triggering the failure(s)? The failed fence would
> certainly help explain the delays. As I mentioned earlier, DLM is
> designed to block when a node is in an unknowned state (failed but not
> yet successfully fenced).
>
> As an aside; I do my HA VMs using clustered LVM LVs as the backing
> storage behind the VMs. GFS2 is an excellent file system, but it is
> expensive. Putting your VMs directly on the LV takes them out of the
> equation
I used 'service network stop' to simulate the failure, the node get
fenced through fence_bladecenter (BladeCenter HW)
Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM
LVs, I'm trying for many hours to reproduce the issue
- only the node where I execute 'service network stop' get fenced
- using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain
writable also while fencing take place
All seems to work like a charm now.
I'd like to understand what was happening. I'll try for same day before
trusting it.
Thank you so much.
Maurizio
More information about the CentOS-virt
mailing list