[CentOS] HA cluster - strange communication between nodes

Wed Jan 15 23:29:28 UTC 2014
Leon Fauster <leonfauster at googlemail.com>

Am 15.01.2014 um 11:56 schrieb Martin Moravcik <centos at datalock.sk>:
> 
> Thanks for your interest and for your help.
> Here is the output from command (pcs config show)
> 
> [root at lb1 ~]# pcs config show
> Cluster Name: LB.STK
> Corosync Nodes:
> 
> Pacemaker Nodes:
>  lb1.asol.local lb2.asol.local
> 
> Resources:
>  Group: LB
>   Resource: LAN.VIP (class=ocf provider=heartbeat type=IPaddr2)
>    Attributes: ip=172.16.139.113 cidr_netmask=24 nic=eth1
>    Operations: monitor interval=15s (LAN.VIP-monitor-interval-15s)
>   Resource: WAN.VIP (class=ocf provider=heartbeat type=IPaddr2)
>    Attributes: ip=172.16.139.110 cidr_netmask=24 nic=eth0
>    Operations: monitor interval=15s (WAN.VIP-monitor-interval-15s)
>   Resource: OPENVPN (class=lsb type=openvpn)
>    Operations: monitor interval=20s (OPENVPN-monitor-interval-20s)
>                start interval=0s timeout=20s (OPENVPN-start-timeout-20s)
>                stop interval=0s timeout=20s (OPENVPN-stop-timeout-20s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
> Colocation Constraints:
> 
> Cluster Properties:
>  cluster-infrastructure: cman
>  dc-version: 1.1.10-14.el6_5.1-368c726
>  stonith-enabled: false
> 
> 
> When I start cluster after reboot of both nodes, everythings looks fine. 
> But when shoot command "pcs resource delete OPENVPN" from node lb1 in 
> the log starts to popup these lines:
> Jan 15 13:56:37 corosync [TOTEM ] Retransmit List: 202
> Jan 15 13:57:08 corosync [TOTEM ] Retransmit List: 202 203
> Jan 15 13:57:38 corosync [TOTEM ] Retransmit List: 202 203 204
> Jan 15 13:58:08 corosync [TOTEM ] Retransmit List: 202 203 204 206
> Jan 15 13:58:38 corosync [TOTEM ] Retransmit List: 202 203 204 206 208
> Jan 15 13:59:08 corosync [TOTEM ] Retransmit List: 202 203 204 206 208 209
> 
> I also noticed, that these retransmit entries starts to appear even 
> after some time (7 minutes) from fresh cluster start without doing any 
> change or manipulation with cluster.


there exists multicast issues on virtual nodes - therefore your bridged network
will for sure not operate reliable out of the box for HA setups.   

try 

echo 1 > /sys/class/net/YOURDEVICE/bridge/multicast_querier

--
LF