[CentOS] Problem with CLVM (really openais)

I'm desparately looking for more ideas on how to debug what's going on
with our CLVM cluster. 

Background:

4 node "cluster"-- machines are Dell blades with Dell M6220/M6348 switches.
Sole purpose of Cluster Suite tools is to use CLVM against an iSCSI storage
array.

Machines are running CentOS 5.8 with the Xen kernels. These blades host
various VMs for a project. The iSCSI back-end storage hosts the disk
images for the VMs. The blades themselves run from local disk.

Each blade has 3 active networks:

-- Front-end, public net.
-- Back-end net (backups, Database connections to external servers,
                cluster communication)
-- iSCSI net

Front and Back nets are on Xen Bridges and available/used by the VMs. 
iSCSI net only used by Dom0/blades.

Originally got CLVM working by installing Luci/Ricci. Cluster config
is dead-simple:

<?xml version="1.0"?>
<cluster alias="Alliance Blades" config_version="6" name="Alliance Blades">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
	<clusternode name="calgb-blade1-mn.mayo.edu" nodeid="1" votes="1">
	    <fence/>
	</clusternode>
	<clusternode name="calgb-blade2-mn.mayo.edu" nodeid="3" votes="1">
	    <fence/>
	</clusternode>
	<clusternode name="calgb-blade3-mn.mayo.edu" nodeid="4" votes="1">
	    <fence/>
	</clusternode>
	<clusternode name="calgb-blade4-mn.mayo.edu" nodeid="2" votes="1">
	    <fence/>
	</clusternode>
    </clusternodes>
    <cman/>
    <fencedevices/>
    <rm>
	<failoverdomains/>
	<resources/>
    </rm>
</cluster>

All the basics are covered... LVM locking set to "Cluster", etc.

This all worked fine for a period (pre 5.8)... at some point, an update
took a step backwards.

I can bring up the entire cluster and it will work for about 10 minutes.
During that time, I'll get the following on several of the nodes:

Nov  3 17:28:18 calgb-blade2 openais[7154]: [TOTEM] Retransmit List: fe
Nov  3 17:28:49 calgb-blade2 last message repeated 105 times
Nov  3 17:29:50 calgb-blade2 last message repeated 154 times
Nov  3 17:30:51 calgb-blade2 last message repeated 154 times
Nov  3 17:31:52 calgb-blade2 last message repeated 154 times
Nov  3 17:32:53 calgb-blade2 last message repeated 154 times
Nov  3 17:33:54 calgb-blade2 last message repeated 154 times
Nov  3 17:34:55 calgb-blade2 last message repeated 154 times
Nov  3 17:35:56 calgb-blade2 last message repeated 154 times
Nov  3 17:36:36 calgb-blade2 last message repeated 105 times
Nov  3 17:36:36 calgb-blade2 openais[7154]: [TOTEM] Retransmit List: fe ff
Nov  3 17:37:07 calgb-blade2 last message repeated 104 times
Nov  3 17:38:08 calgb-blade2 last message repeated 154 times
Nov  3 17:39:09 calgb-blade2 last message repeated 154 times
Nov  3 17:40:10 calgb-blade2 last message repeated 154 times
Nov  3 17:41:11 calgb-blade2 last message repeated 154 times
Nov  3 17:42:12 calgb-blade2 last message repeated 154 times
Nov  3 17:43:13 calgb-blade2 last message repeated 154 times
Nov  3 17:44:14 calgb-blade2 last message repeated 154 times
Nov  3 17:44:24 calgb-blade2 last message repeated 26 times

Around the 10 minute mark, one of the nodes (it is not always the same node)
will do:

Nov  3 17:44:24 calgb-blade1 openais[7179]: [TOTEM] FAILED TO RECEIVE
Nov  3 17:44:24 calgb-blade1 openais[7179]: [TOTEM] entering GATHER state from 6.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Creating commit token because I am the rep.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Storing new sequence id for ring 34
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] entering COMMIT state.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] entering RECOVERY state.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] position [0] member 192.168.226.161:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] previous ring seq 48 rep 192.168.226.161 
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] aru fd high delivered fd received flag 1 
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Did not need to originate any messages in recovery.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Sending initial ORF token
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] New Configuration:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.161)
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Left:
Nov  3 17:44:34 calgb-blade1 clurgmgrd[9933]: <emerg> #1: Quorum Dissolved
Nov  3 17:44:34 calgb-blade1 kernel: dlm: closing connection to node 2
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.162)
Nov  3 17:44:34 calgb-blade1 kernel: dlm: closing connection to node 3
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.163)
Nov  3 17:44:34 calgb-blade1 kernel: dlm: closing connection to node 4
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.164)
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Joined:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CMAN ] quorum lost, blocking activity
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] New Configuration:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.161)
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Left:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Joined:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [SYNC ] This node is within the primary component and will provide service.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] entering OPERATIONAL state.
...

When this happens, the node that lost connections kills CMAN. The other 3 
nodes get into this state:

[root at calgb-blade2 ~]# group_tool
type             level name       id       state       
fence            0     default    00010001 FAIL_START_WAIT
[1 2 3]
dlm              1     clvmd      00010003 FAIL_ALL_STOPPED
[1 2 3 4]
dlm              1     rgmanager  00020003 FAIL_ALL_STOPPED
[1 2 3 4]

One of the nodes will be barking about trying to fence the "failed" node
(expected, as I don't have real fencing).  

Nothing (physically) has actually failed. All 4 blades are running their 
VMs and accessing the shared back-end storage. No network burps, etc.
CLVM is unusable and any LVM commands hang.

Here's where it gets really strange.....

1. Go to the blade that has "failed" and shut down all the VMs (just to be
   safe-- they are all running fine).

2. Go to the node that tried to fence the failed node and run:
   fence_ack_manual -e -n <failed node>

3. The 3 "surviving" nodes are instantly fine. group_tool reports "none" for
   state and all CLVM commands work properly.

4. OK, now reboot the "failed" node. It reboots and rejoins the cluster. 
   CLVM commands work, but are slow. Lots of these errors:
	openais[7154]: [TOTEM] Retransmit List: fe ff

5. This goes on for about 10 minutes and the whole cycle repeats (one of 
   the other nodes will "fail"...)

What I've tried:

1. Switches have IGMP snooping disabled. This is a simple config, so
   no switch-to-switch multicast is needed (all cluster/multicast traffic
   stays on the blade enclosure switch).  I've had the cluster
   messages use the front-end net and the back-end net (different switch
   model)-- no change in behavior.

2. Running the latest RH patched openais from their Beta channel: 
   openais-0.80.6-37.el5  (yes, I actually have RH licenses, just prefer
   CentOS for various reasons).

3. Tested multicast by enabling multicast/ICMP and running multicast
   pings. Ran with no data loss for > 10 minutes. (IBM Tech web site
   article-- where the end of the article say it's almost always a network
   problem.)

4. Tried various configuration changes such as defining two rings. No change
   (or got worse- the dual ring config triggers a kernel bug).

I've read about every article I can find... they usually fall into
two camps:

1. Multiple years old, so no idea if it's accurate with today's versions of
   the software. 

2. People trying to do 2-node clusters and wondering why they lose quorum.

My feeling is that this is an openais issue-- the ring gets out of sync
and can't fix itself. This system needs to go "production" in a matter of 
weeks, so I'm not sure I want to dive into doing some sort of
custom compiled CLVM/Corosync/Pacemaker config. Since this software
has been around for years, I believe it's something simple I'm just missing.

Thanks for any help/ideas...

--- Cris

-- 
 Cristopher J. Rhea
 Mayo Clinic - Research Computing Facility
 200 First St SW, Rochester, MN 55905
 crhea at Mayo.EDU
 (507) 284-0587