GFS2 hangs after one node going down

List overview All Threads
Download

newer

older

Can I have more than 1 bridge...

Problems with qemu img disks...

Maurizio Giungato

21 Mar 2013 21 Mar '13

5:11 p.m.

Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

1) GFS2 hangs (processes go in D state) until node6.blade get fenced 2) not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

Show replies by date

Digimer

21 Mar 21 Mar

5:14 p.m.

On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

...

Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

GFS2 hangs (processes go in D state) until node6.blade get fenced

not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

Maurizio Giungato

5:48 p.m.

Il 21/03/2013 18:14, Digimer ha scritto:

...

On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

...
Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

GFS2 hangs (processes go in D state) until node6.blade get fenced

not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

Maurizio Giungato

6:09 p.m.

Il 21/03/2013 18:48, Maurizio Giungato ha scritto:

...

Il 21/03/2013 18:14, Digimer ha scritto:

...
On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

...
Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

GFS2 hangs (processes go in D state) until node6.blade get fenced

not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

I attached my logs from /var/log/cluster/*

Digimer

11:34 p.m.

On 03/21/2013 02:09 PM, Maurizio Giungato wrote:

...

Il 21/03/2013 18:48, Maurizio Giungato ha scritto:

...
Il 21/03/2013 18:14, Digimer ha scritto:

...
On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

...
Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

GFS2 hangs (processes go in D state) until node6.blade get fenced

not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter result: error from agent Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly help explain the delays. As I mentioned earlier, DLM is designed to block when a node is in an unknowned state (failed but not yet successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing storage behind the VMs. GFS2 is an excellent file system, but it is expensive. Putting your VMs directly on the LV takes them out of the equation.

-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

Maurizio Giungato

22 Mar 22 Mar

3:21 p.m.

Il 22/03/2013 00:34, Digimer ha scritto:

...

On 03/21/2013 02:09 PM, Maurizio Giungato wrote:

...
Il 21/03/2013 18:48, Maurizio Giungato ha scritto:

...
Il 21/03/2013 18:14, Digimer ha scritto:

...
On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

...
Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

GFS2 hangs (processes go in D state) until node6.blade get fenced

not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter result: error from agent Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly help explain the delays. As I mentioned earlier, DLM is designed to block when a node is in an unknowned state (failed but not yet successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing storage behind the VMs. GFS2 is an excellent file system, but it is expensive. Putting your VMs directly on the LV takes them out of the equation

I used 'service network stop' to simulate the failure, the node get fenced through fence_bladecenter (BladeCenter HW)

Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM LVs, I'm trying for many hours to reproduce the issue

- only the node where I execute 'service network stop' get fenced - using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain writable also while fencing take place

All seems to work like a charm now.

I'd like to understand what was happening. I'll try for same day before trusting it.

Thank you so much. Maurizio

Digimer

3:27 p.m.

On 03/22/2013 11:21 AM, Maurizio Giungato wrote:

...

Il 22/03/2013 00:34, Digimer ha scritto:

...
On 03/21/2013 02:09 PM, Maurizio Giungato wrote:

...
Il 21/03/2013 18:48, Maurizio Giungato ha scritto:

...
Il 21/03/2013 18:14, Digimer ha scritto:

...
On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

...
Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

GFS2 hangs (processes go in D state) until node6.blade get fenced

not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter result: error from agent Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly help explain the delays. As I mentioned earlier, DLM is designed to block when a node is in an unknowned state (failed but not yet successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing storage behind the VMs. GFS2 is an excellent file system, but it is expensive. Putting your VMs directly on the LV takes them out of the equation

I used 'service network stop' to simulate the failure, the node get fenced through fence_bladecenter (BladeCenter HW)

Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM LVs, I'm trying for many hours to reproduce the issue

only the node where I execute 'service network stop' get fenced

using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain

writable also while fencing take place

All seems to work like a charm now.

I'd like to understand what was happening. I'll try for same day before trusting it.

Thank you so much. Maurizio

Testing testing testing. It's good that you plan to test before trusting. I wish everyone had that philosophy!

The clustered locking for LVM comes into play for activating/inactivating, creating, deleting, resizing and so on. It does not affect what happens in an LV. That's why an LV remains writeable when a fence is pending. However, I feel this is safe because rgmanager won't recover a VM on another node until the lost node is fenced.

Cheers

-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

Maurizio Giungato

25 Mar 25 Mar

12:44 p.m.

Il 22/03/2013 16:27, Digimer ha scritto:

...

On 03/22/2013 11:21 AM, Maurizio Giungato wrote:

...
Il 22/03/2013 00:34, Digimer ha scritto:

...
On 03/21/2013 02:09 PM, Maurizio Giungato wrote:

...
Il 21/03/2013 18:48, Maurizio Giungato ha scritto:

...
Il 21/03/2013 18:14, Digimer ha scritto:

...
On 03/21/2013 01:11 PM, Maurizio Giungato wrote: > Hi guys, > > my goal is to create a reliable virtualization environment using > CentOS > 6.4 and KVM, I've three nodes and a clustered GFS2. > > The enviroment is up and working, but I'm worry for the > reliability, if > I turn the network interface down on one node to simulate a crash > (for > example on the node "node6.blade"): > > 1) GFS2 hangs (processes go in D state) until node6.blade get > fenced > 2) not only node6.blade get fenced, but also node5.blade! > > Help me to save my last neurons! > > Thanks > Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter result: error from agent Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly help explain the delays. As I mentioned earlier, DLM is designed to block when a node is in an unknowned state (failed but not yet successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing storage behind the VMs. GFS2 is an excellent file system, but it is expensive. Putting your VMs directly on the LV takes them out of the equation

I used 'service network stop' to simulate the failure, the node get fenced through fence_bladecenter (BladeCenter HW)

Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM LVs, I'm trying for many hours to reproduce the issue

only the node where I execute 'service network stop' get fenced

using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain

writable also while fencing take place

All seems to work like a charm now.

I'd like to understand what was happening. I'll try for same day before trusting it.

Thank you so much. Maurizio

Testing testing testing. It's good that you plan to test before trusting. I wish everyone had that philosophy!

The clustered locking for LVM comes into play for activating/inactivating, creating, deleting, resizing and so on. It does not affect what happens in an LV. That's why an LV remains writeable when a fence is pending. However, I feel this is safe because rgmanager won't recover a VM on another node until the lost node is fenced.

Cheers

Thank you very much! The cluster continue working like a charm. Failure after failure I mean :)

We are not using rgmanager fault management because doesn't have a check about the memory availability on the destination node, then we prefer to manage this situation with custom script we wrote.

last questions: - have you any advice to improve the tollerance against network failures? - to avoid having a gfs2 only for VM's xml, I've thought to keep them on each node synced with rsync. Any alternatives? - If I want to have only the clustered LVM without no other functions, can you advice about a minimal configuration? (for example I think that rgmanager is not necessary)

Thank you in advance

Digimer

4:49 p.m.

On 03/25/2013 08:44 AM, Maurizio Giungato wrote:

...

Il 22/03/2013 16:27, Digimer ha scritto:

...
On 03/22/2013 11:21 AM, Maurizio Giungato wrote:

...
Il 22/03/2013 00:34, Digimer ha scritto:

...
On 03/21/2013 02:09 PM, Maurizio Giungato wrote:

...
Il 21/03/2013 18:48, Maurizio Giungato ha scritto:

...
Il 21/03/2013 18:14, Digimer ha scritto: > On 03/21/2013 01:11 PM, Maurizio Giungato wrote: >> Hi guys, >> >> my goal is to create a reliable virtualization environment using >> CentOS >> 6.4 and KVM, I've three nodes and a clustered GFS2. >> >> The enviroment is up and working, but I'm worry for the >> reliability, if >> I turn the network interface down on one node to simulate a crash >> (for >> example on the node "node6.blade"): >> >> 1) GFS2 hangs (processes go in D state) until node6.blade get >> fenced >> 2) not only node6.blade get fenced, but also node5.blade! >> >> Help me to save my last neurons! >> >> Thanks >> Maurizio > > DLM, the distributed lock manager provided by the cluster, is > designed to block when a known goes into an unknown state. It does > not unblock until that node is confirmed to be fenced. This is by > design. GFS2, rgmanager and clustered LVM all use DLM, so they will > all block as well. > > As for why two nodes get fenced, you will need to share more about > your configuration. > My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter result: error from agent Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly help explain the delays. As I mentioned earlier, DLM is designed to block when a node is in an unknowned state (failed but not yet successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing storage behind the VMs. GFS2 is an excellent file system, but it is expensive. Putting your VMs directly on the LV takes them out of the equation

I used 'service network stop' to simulate the failure, the node get fenced through fence_bladecenter (BladeCenter HW)

Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM LVs, I'm trying for many hours to reproduce the issue

only the node where I execute 'service network stop' get fenced

using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain

writable also while fencing take place

All seems to work like a charm now.

I'd like to understand what was happening. I'll try for same day before trusting it.

Thank you so much. Maurizio

Testing testing testing. It's good that you plan to test before trusting. I wish everyone had that philosophy!

The clustered locking for LVM comes into play for activating/inactivating, creating, deleting, resizing and so on. It does not affect what happens in an LV. That's why an LV remains writeable when a fence is pending. However, I feel this is safe because rgmanager won't recover a VM on another node until the lost node is fenced.

Cheers

Thank you very much! The cluster continue working like a charm. Failure after failure I mean :)

We are not using rgmanager fault management because doesn't have a check about the memory availability on the destination node, then we prefer to manage this situation with custom script we wrote.

last questions:

have you any advice to improve the tollerance against network failures?

to avoid having a gfs2 only for VM's xml, I've thought to keep them on

each node synced with rsync. Any alternatives?

If I want to have only the clustered LVM without no other functions,

can you advice about a minimal configuration? (for example I think that rgmanager is not necessary)

Thank you in advance

For network redundancy, I use two switches and bonded (mode=1) links with one link going to either switch. This way, losing a NIC or a switch won't break the cluster. Details here:

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network

Using rsync to keep the XML files in sync is fine, if you really don't want to use GFS2.

You do not need rgmanager for clvmd to work. All you need is the base cluster.conf (and working fencing, as you've seen).

If you are over-provisioning VMs and need to worry about memory on target systems, then you might want to take a look at pacemaker. It's in tech-preview currently and will replace rgmanager in rhel7 (well, expected to, nothing is guaranteed 'til release day). Pacemaker is designed, as I understand it, to handle conditions like yours. Further, it is *much* better tested than anything you roll yourself. You can use clvmd with pacemaker by tieing cman into pacemaker.

digimer

-- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

Maurizio Giungato

5:09 p.m.

Il 25/03/2013 17:49, Digimer ha scritto:

...

On 03/25/2013 08:44 AM, Maurizio Giungato wrote:

...
Il 22/03/2013 16:27, Digimer ha scritto:

...
On 03/22/2013 11:21 AM, Maurizio Giungato wrote:

...
Il 22/03/2013 00:34, Digimer ha scritto:

...
On 03/21/2013 02:09 PM, Maurizio Giungato wrote:

...
Il 21/03/2013 18:48, Maurizio Giungato ha scritto: > Il 21/03/2013 18:14, Digimer ha scritto: >> On 03/21/2013 01:11 PM, Maurizio Giungato wrote: >>> Hi guys, >>> >>> my goal is to create a reliable virtualization environment using >>> CentOS >>> 6.4 and KVM, I've three nodes and a clustered GFS2. >>> >>> The enviroment is up and working, but I'm worry for the >>> reliability, if >>> I turn the network interface down on one node to simulate a crash >>> (for >>> example on the node "node6.blade"): >>> >>> 1) GFS2 hangs (processes go in D state) until node6.blade get >>> fenced >>> 2) not only node6.blade get fenced, but also node5.blade! >>> >>> Help me to save my last neurons! >>> >>> Thanks >>> Maurizio >> >> DLM, the distributed lock manager provided by the cluster, is >> designed to block when a known goes into an unknown state. It does >> not unblock until that node is confirmed to be fenced. This is by >> design. GFS2, rgmanager and clustered LVM all use DLM, so they >> will >> all block as well. >> >> As for why two nodes get fenced, you will need to share more about >> your configuration. >> > My configuration is very simple I attached cluster.conf and hosts > files. > This is the row I added in /etc/fstab: > /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 > defaults,noatime,nodiratime 0 0 > > I set also fallback_to_local_locking = 0 in lvm.conf (but nothing > change) > > PS: I had two virtualization enviroments working like a charm on > OCFS2, but since Centos 6.x I'm not able to install it, there is > same > way to achieve the same results with GFS2 (with GFS2 sometime > I've a > crash after only a "service network restart" [I've many interfaces > then this operation takes more than 10 seconds], with OCFS2 I've > never > had this problem. > > Thanks I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter result: error from agent Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly help explain the delays. As I mentioned earlier, DLM is designed to block when a node is in an unknowned state (failed but not yet successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing storage behind the VMs. GFS2 is an excellent file system, but it is expensive. Putting your VMs directly on the LV takes them out of the equation

I used 'service network stop' to simulate the failure, the node get fenced through fence_bladecenter (BladeCenter HW)

Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM LVs, I'm trying for many hours to reproduce the issue

only the node where I execute 'service network stop' get fenced

using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain

writable also while fencing take place

All seems to work like a charm now.

I'd like to understand what was happening. I'll try for same day before trusting it.

Thank you so much. Maurizio

Testing testing testing. It's good that you plan to test before trusting. I wish everyone had that philosophy!

The clustered locking for LVM comes into play for activating/inactivating, creating, deleting, resizing and so on. It does not affect what happens in an LV. That's why an LV remains writeable when a fence is pending. However, I feel this is safe because rgmanager won't recover a VM on another node until the lost node is fenced.

Cheers

Thank you very much! The cluster continue working like a charm. Failure after failure I mean :)

We are not using rgmanager fault management because doesn't have a check about the memory availability on the destination node, then we prefer to manage this situation with custom script we wrote.

last questions:

have you any advice to improve the tollerance against network

failures?

to avoid having a gfs2 only for VM's xml, I've thought to keep them on

each node synced with rsync. Any alternatives?

If I want to have only the clustered LVM without no other functions,

can you advice about a minimal configuration? (for example I think that rgmanager is not necessary)

Thank you in advance

For network redundancy, I use two switches and bonded (mode=1) links with one link going to either switch. This way, losing a NIC or a switch won't break the cluster. Details here:

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network

Using rsync to keep the XML files in sync is fine, if you really don't want to use GFS2.

You do not need rgmanager for clvmd to work. All you need is the base cluster.conf (and working fencing, as you've seen).

If you are over-provisioning VMs and need to worry about memory on target systems, then you might want to take a look at pacemaker. It's in tech-preview currently and will replace rgmanager in rhel7 (well, expected to, nothing is guaranteed 'til release day). Pacemaker is designed, as I understand it, to handle conditions like yours. Further, it is *much* better tested than anything you roll yourself. You can use clvmd with pacemaker by tieing cman into pacemaker.

digime

Perfect, I've the same network configuration, on the other cluster I've four switches and I could create two bonds, one is dedicated to corosync then I was afraid that a single bond was little ;)

Thank you again

Zoltan Frombach

21 Mar 21 Mar

11:50 p.m.

It's not related to your problem. Just a note: when you use the noatime mounting option in fstab then you do not need to use nodiratime because noatime takes care of both.

Zoltan

On 3/21/2013 6:48 PM, Maurizio Giungato wrote:

...

Il 21/03/2013 18:14, Digimer ha scritto:

...
On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

...
Hi guys,

my goal is to create a reliable virtualization environment using CentOS 6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if I turn the network interface down on one node to simulate a crash (for example on the node "node6.blade"):

GFS2 hangs (processes go in D state) until node6.blade get fenced

not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks Maurizio

DLM, the distributed lock manager provided by the cluster, is designed to block when a known goes into an unknown state. It does not unblock until that node is confirmed to be fenced. This is by design. GFS2, rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your configuration.

My configuration is very simple I attached cluster.conf and hosts files. This is the row I added in /etc/fstab: /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 defaults,noatime,nodiratime 0 0

I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on OCFS2, but since Centos 6.x I'm not able to install it, there is same way to achieve the same results with GFS2 (with GFS2 sometime I've a crash after only a "service network restart" [I've many interfaces then this operation takes more than 10 seconds], with OCFS2 I've never had this problem.

Thanks

CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt

4710

Age (days ago)

4714

Last active (days ago)

virt@lists.centos.org

10 comments

3 participants

tags (0)

participants (3)

Digimer
Maurizio Giungato
Zoltan Frombach