Re: [CentOS] Problem with CLVM (really openais)

4 Nov 2012

      On 11/04/2012 10:48 AM, Cris Rhea wrote:
...
One of the nodes will be barking about trying to fence the "failed" node
(expected, as I don't have real fencing).
This is your problem. Without fencing, DLM (which is required for
clustered LVM, GFS2 and rgmanager) is designed to block when a fence is
called and stay blocked until the fence succeeds. Why it was called is
secondary, even if a human calls the fence and everything is otherwise
working fine, the cluster will hang.
This is by design. "A hung cluster is better than a corrupt cluster".
...
Nothing (physically) has actually failed. All 4 blades are running their 
VMs and accessing the shared back-end storage. No network burps, etc.
CLVM is unusable and any LVM commands hang.
Here's where it gets really strange.....

Go to the blade that has "failed" and shut down all the VMs (just to be
safe-- they are all running fine).

Go to the node that tried to fence the failed node and run:
fence_ack_manual -e -n <failed node>

The 3 "surviving" nodes are instantly fine. group_tool reports "none" for
state and all CLVM commands work properly.

OK, now reboot the "failed" node. It reboots and rejoins the cluster. 
CLVM commands work, but are slow. Lots of these errors:
openais[7154]: [TOTEM] Retransmit List: fe ff

Only time I've seen this happen is when something starves a node (slow
network, loaded cpu, insufficient ram...).
...

This goes on for about 10 minutes and the whole cycle repeats (one of 
the other nodes will "fail"...)

If a totem packet fails to return from a node within a set period of
time more than a set number of times in a row, the node is declared lost
and a fence action is initiated.
...
What I've tried:

Switches have IGMP snooping disabled. This is a simple config, so
no switch-to-switch multicast is needed (all cluster/multicast traffic
stays on the blade enclosure switch).  I've had the cluster
messages use the front-end net and the back-end net (different switch
model)-- no change in behavior.

Are the multicast groups static? Is STP disabled?
...

Running the latest RH patched openais from their Beta channel: 
openais-0.80.6-37.el5  (yes, I actually have RH licenses, just prefer
CentOS for various reasons).

Tested multicast by enabling multicast/ICMP and running multicast
pings. Ran with no data loss for > 10 minutes. (IBM Tech web site
article-- where the end of the article say it's almost always a network
problem.)

I'm leaning to a network problem, too.
...

Tried various configuration changes such as defining two rings. No change
(or got worse- the dual ring config triggers a kernel bug).

I don't think RRP worked in EL5... Maybe it does now?
...
I've read about every article I can find... they usually fall into
two camps:

Multiple years old, so no idea if it's accurate with today's versions of
the software.

People trying to do 2-node clusters and wondering why they lose quorum.

My feeling is that this is an openais issue-- the ring gets out of sync
and can't fix itself. This system needs to go "production" in a matter of 
weeks, so I'm not sure I want to dive into doing some sort of
custom compiled CLVM/Corosync/Pacemaker config. Since this software
has been around for years, I believe it's something simple I'm just missing.
Thanks for any help/ideas...
--- Cris
First and foremost; Get fencing working. At the very least, a lost node
will reboot and the cluster will recover as designed. It's amazing how
many problems "just go away" once fencing is properly configured. Please
read this:
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing
It's for EL6, but it applies to EL5 exactly the same (corosync being the
functional replacement for openais).
-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [CentOS] Problem with CLVM (really openais)