On 11/04/2012 10:48 AM, Cris Rhea wrote: > One of the nodes will be barking about trying to fence the "failed" node > (expected, as I don't have real fencing). This is your problem. Without fencing, DLM (which is required for clustered LVM, GFS2 and rgmanager) is designed to block when a fence is called and stay blocked until the fence succeeds. Why it was called is secondary, even if a human calls the fence and everything is otherwise working fine, the cluster will hang. This is by design. "A hung cluster is better than a corrupt cluster". > Nothing (physically) has actually failed. All 4 blades are running their > VMs and accessing the shared back-end storage. No network burps, etc. > CLVM is unusable and any LVM commands hang. > > Here's where it gets really strange..... > > 1. Go to the blade that has "failed" and shut down all the VMs (just to be > safe-- they are all running fine). > > 2. Go to the node that tried to fence the failed node and run: > fence_ack_manual -e -n <failed node> > > 3. The 3 "surviving" nodes are instantly fine. group_tool reports "none" for > state and all CLVM commands work properly. > > 4. OK, now reboot the "failed" node. It reboots and rejoins the cluster. > CLVM commands work, but are slow. Lots of these errors: > openais[7154]: [TOTEM] Retransmit List: fe ff Only time I've seen this happen is when something starves a node (slow network, loaded cpu, insufficient ram...). > 5. This goes on for about 10 minutes and the whole cycle repeats (one of > the other nodes will "fail"...) If a totem packet fails to return from a node within a set period of time more than a set number of times in a row, the node is declared lost and a fence action is initiated. > What I've tried: > > 1. Switches have IGMP snooping disabled. This is a simple config, so > no switch-to-switch multicast is needed (all cluster/multicast traffic > stays on the blade enclosure switch). I've had the cluster > messages use the front-end net and the back-end net (different switch > model)-- no change in behavior. Are the multicast groups static? Is STP disabled? > 2. Running the latest RH patched openais from their Beta channel: > openais-0.80.6-37.el5 (yes, I actually have RH licenses, just prefer > CentOS for various reasons). > > 3. Tested multicast by enabling multicast/ICMP and running multicast > pings. Ran with no data loss for > 10 minutes. (IBM Tech web site > article-- where the end of the article say it's almost always a network > problem.) I'm leaning to a network problem, too. > 4. Tried various configuration changes such as defining two rings. No change > (or got worse- the dual ring config triggers a kernel bug). I don't think RRP worked in EL5... Maybe it does now? > I've read about every article I can find... they usually fall into > two camps: > > 1. Multiple years old, so no idea if it's accurate with today's versions of > the software. > > 2. People trying to do 2-node clusters and wondering why they lose quorum. > > My feeling is that this is an openais issue-- the ring gets out of sync > and can't fix itself. This system needs to go "production" in a matter of > weeks, so I'm not sure I want to dive into doing some sort of > custom compiled CLVM/Corosync/Pacemaker config. Since this software > has been around for years, I believe it's something simple I'm just missing. > > Thanks for any help/ideas... > > --- Cris First and foremost; Get fencing working. At the very least, a lost node will reboot and the cluster will recover as designed. It's amazing how many problems "just go away" once fencing is properly configured. Please read this: https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing It's for EL6, but it applies to EL5 exactly the same (corosync being the functional replacement for openais). -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?