[CentOS] Problem with CLVM (really openais)

On 11/04/2012 10:48 AM, Cris Rhea wrote:
> One of the nodes will be barking about trying to fence the "failed" node
> (expected, as I don't have real fencing).  

This is your problem. Without fencing, DLM (which is required for
clustered LVM, GFS2 and rgmanager) is designed to block when a fence is
called and stay blocked until the fence succeeds. Why it was called is
secondary, even if a human calls the fence and everything is otherwise
working fine, the cluster will hang.

This is by design. "A hung cluster is better than a corrupt cluster".

> Nothing (physically) has actually failed. All 4 blades are running their 
> VMs and accessing the shared back-end storage. No network burps, etc.
> CLVM is unusable and any LVM commands hang.
> 
> Here's where it gets really strange.....
> 
> 1. Go to the blade that has "failed" and shut down all the VMs (just to be
>    safe-- they are all running fine).
> 
> 2. Go to the node that tried to fence the failed node and run:
>    fence_ack_manual -e -n <failed node>
> 
> 3. The 3 "surviving" nodes are instantly fine. group_tool reports "none" for
>    state and all CLVM commands work properly.
> 
> 4. OK, now reboot the "failed" node. It reboots and rejoins the cluster. 
>    CLVM commands work, but are slow. Lots of these errors:
> 	openais[7154]: [TOTEM] Retransmit List: fe ff

Only time I've seen this happen is when something starves a node (slow
network, loaded cpu, insufficient ram...).

> 5. This goes on for about 10 minutes and the whole cycle repeats (one of 
>    the other nodes will "fail"...)

If a totem packet fails to return from a node within a set period of
time more than a set number of times in a row, the node is declared lost
and a fence action is initiated.

> What I've tried:
> 
> 1. Switches have IGMP snooping disabled. This is a simple config, so
>    no switch-to-switch multicast is needed (all cluster/multicast traffic
>    stays on the blade enclosure switch).  I've had the cluster
>    messages use the front-end net and the back-end net (different switch
>    model)-- no change in behavior.

Are the multicast groups static? Is STP disabled?

> 2. Running the latest RH patched openais from their Beta channel: 
>    openais-0.80.6-37.el5  (yes, I actually have RH licenses, just prefer
>    CentOS for various reasons).
> 
> 3. Tested multicast by enabling multicast/ICMP and running multicast
>    pings. Ran with no data loss for > 10 minutes. (IBM Tech web site
>    article-- where the end of the article say it's almost always a network
>    problem.)

I'm leaning to a network problem, too.

> 4. Tried various configuration changes such as defining two rings. No change
>    (or got worse- the dual ring config triggers a kernel bug).

I don't think RRP worked in EL5... Maybe it does now?

> I've read about every article I can find... they usually fall into
> two camps:
> 
> 1. Multiple years old, so no idea if it's accurate with today's versions of
>    the software. 
> 
> 2. People trying to do 2-node clusters and wondering why they lose quorum.
> 
> My feeling is that this is an openais issue-- the ring gets out of sync
> and can't fix itself. This system needs to go "production" in a matter of 
> weeks, so I'm not sure I want to dive into doing some sort of
> custom compiled CLVM/Corosync/Pacemaker config. Since this software
> has been around for years, I believe it's something simple I'm just missing.
> 
> Thanks for any help/ideas...
>  
> --- Cris

First and foremost; Get fencing working. At the very least, a lost node
will reboot and the cluster will recover as designed. It's amazing how
many problems "just go away" once fencing is properly configured. Please
read this:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing

It's for EL6, but it applies to EL5 exactly the same (corosync being the
functional replacement for openais).

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?