I went a bit further... lvs1# service keepalived stop lvs2# service keepalived stop lvs1# service network restart lvs2# service network restart Clean start lvs1# service keepalived start Feb 25 15:03:18 lvs1 Keepalived: Starting Keepalived v1.1.16 (02/17,2009) Feb 25 15:03:18 lvs1 Keepalived: Starting Healthcheck child process, pid=9511 Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Using MII-BMSR NIC polling thread... Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Netlink reflector reports IP 192.168.28.226 added Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Netlink reflector reports IP 10.0.0.1 added Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Registering Kernel netlink reflector Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Registering Kernel netlink command channel Feb 25 15:03:18 lvs1 Keepalived: Starting VRRP child process, pid=9512 Feb 25 15:03:18 lvs1 Keepalived_vrrp: Using MII-BMSR NIC polling thread... Feb 25 15:03:18 lvs1 Keepalived_vrrp: Netlink reflector reports IP 192.168.28.226 added Feb 25 15:03:18 lvs1 Keepalived_vrrp: Netlink reflector reports IP 10.0.0.1 added Feb 25 15:03:18 lvs1 Keepalived_vrrp: Registering Kernel netlink reflector Feb 25 15:03:18 lvs1 Keepalived_vrrp: Registering Kernel netlink command channel Feb 25 15:03:18 lvs1 Keepalived_vrrp: Registering gratutious ARP shared channel Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Opening file '/etc/keepalived/keepalived.conf'. Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Configuration is using : 13235 Bytes Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.11:80] Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.12:80] Feb 25 15:03:18 lvs1 Keepalived_vrrp: Opening file '/etc/keepalived/keepalived.conf'. Feb 25 15:03:18 lvs1 Keepalived_vrrp: Configuration is using : 34062 Bytes Feb 25 15:03:18 lvs1 Keepalived_vrrp: VRRP sockpool: [ifindex(2), proto(112), fd(10,11)] No VIP and no checks on the web servers... lvs2# service keepalived start Feb 25 15:05:23 lvs2 Keepalived: Starting Keepalived v1.1.16 (02/17,2009) Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Using MII-BMSR NIC polling thread... Feb 25 15:05:23 lvs2 Keepalived: Starting Healthcheck child process, pid=8718 Feb 25 15:05:23 lvs2 Keepalived_vrrp: Using MII-BMSR NIC polling thread... Feb 25 15:05:23 lvs2 Keepalived: Starting VRRP child process, pid=8719 Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Netlink reflector reports IP 192.168.28.227 added Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Netlink reflector reports IP 10.0.0.2 added Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Registering Kernel netlink reflector Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Registering Kernel netlink command channel Feb 25 15:05:23 lvs2 Keepalived_vrrp: Netlink reflector reports IP 192.168.28.227 added Feb 25 15:05:23 lvs2 Keepalived_vrrp: Netlink reflector reports IP 10.0.0.2 added Feb 25 15:05:23 lvs2 Keepalived_vrrp: Registering Kernel netlink reflector Feb 25 15:05:23 lvs2 Keepalived_vrrp: Registering Kernel netlink command channel Feb 25 15:05:23 lvs2 Keepalived_vrrp: Registering gratutious ARP shared channel Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Opening file '/etc/keepalived/keepalived.conf'. Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Configuration is using : 13233 Bytes Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.11:80] Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.12:80] Feb 25 15:05:23 lvs2 Keepalived_vrrp: Opening file '/etc/keepalived/keepalived.conf'. Feb 25 15:05:23 lvs2 Keepalived_vrrp: Configuration is using : 34060 Bytes Feb 25 15:05:23 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Entering BACKUP STATE Feb 25 15:05:23 lvs2 Keepalived_vrrp: VRRP sockpool: [ifindex(2), proto(112), fd(10,11)] No VIP and only one check on the web servers... lvs1# service keepalived stop Feb 25 15:07:30 lvs1 Keepalived: Terminating on signal Feb 25 15:07:30 lvs1 Keepalived: Stopping Keepalived v1.1.16 (02/17,2009) Feb 25 15:07:30 lvs1 Keepalived_vrrp: Terminating VRRP child process on signal Feb 25 15:07:30 lvs1 Keepalived_healthcheckers: Terminating Healthchecker child process on signal And nothing else (lvs2 does not become MASTER)... lvs1# service keepalived start Nothing much... lvs2# service keepalived stop lvs2# service keepalived start Nothing and no checks on the web servers... lvs1# service keepalived stop lvs1# service keepalived start Nothing and no checks on the web servers... lvs1# service keepalived stop lvs1# service keepalived start Nothing and only one check on the web servers... Always stuck on "VRRP sockpool" By the way, a restart or a stop+restart too fast too often leads to a failed start with "daemon is already running" lvs1# service keepalived restart Nothing and no checks on the web servers... lvs1# service keepalived restart Nothing and no checks on the web servers... lvs1# service keepalived restart Nothing and no checks on the web servers... lvs1# service keepalived restart Baam, suddenly many vrrp packets, and one web servers check Feb 25 15:15:11 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Received lower prio advert, forcing new election Feb 25 15:15:11 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.16.123 Feb 25 15:15:11 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.16.123 Feb 25 15:15:16 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Received lower prio advert, forcing new election Feb 25 15:15:16 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.16.123 Feb 25 15:14:50 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Transition to MASTER STATE Feb 25 15:14:50 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Received higher prio advert Feb 25 15:14:50 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Entering BACKUP STATE Feb 25 15:14:55 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Transition to MASTER STATE Feb 25 15:14:55 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Received higher prio advert Feb 25 15:14:55 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Entering BACKUP STATE The web servers are correctly accessed from outside in rr; but there are still no web checks from the keepalives... lvs1# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.168.16.123:http rr -> 10.0.0.12:http Route 1 0 28 -> 10.0.0.11:http Route 1 0 28 lvs2# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.168.16.123:http rr -> 10.0.0.12:http Route 1 0 0 -> 10.0.0.11:http Route 1 0 0 lvs1# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:04:23:9e:f3:74 brd ff:ff:ff:ff:ff:ff inet 192.168.28.226/20 brd 192.168.31.255 scope global eth0 inet 192.168.16.123/32 scope global eth0 inet6 fe80::204:23ff:fe9e:f374/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 100 link/ether 00:04:23:9e:f3:75 brd ff:ff:ff:ff:ff:ff inet 10.0.0.1/8 brd 10.255.255.255 scope global eth1 inet6 fe80::204:23ff:fe9e:f375/64 scope link valid_lft forever preferred_lft forever 4: sit0: <NOARP> mtu 1480 qdisc noop link/sit 0.0.0.0 brd 0.0.0.0 No VIP on lvs2 (BACKUP state) lvs1# service keepalived stop Feb 25 15:29:06 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Transition to MASTER STATE tcpdump => VRRP.MCAST.NET: VRRPv2, Advertisement, vrid 51, prio 0, authtype none, intvl 1s, length 20 No VIP on lvs1 and lvs2, ARP resolution for VIP incomplete... lvs2# ip a add dev eth0 local 192.168.16.123/32 scope global Baam, suddenly vrrp packets, and one round (only) of web server checks 15:33:18.639546 IP lvs2.iper > VRRP.MCAST.NET: VRRPv2, Advertisement, vrid 51, prio 99, authtype none, intvl 1s, length 20 15:33:19.641002 IP lvs2.iper > VRRP.MCAST.NET: VRRPv2, Advertisement, vrid 51, prio 99, authtype none, intvl 1s, length 20 lvs1# service keepalived start Nothing... lvs2# service keepalived stop Baam, suddenly vrrp packets, and one round (only) of web server checks The web servers are correctly accessed from outside in rr... lvs2# service keepalived start Nothing, other than Entering BACKUP STATE Both lvs have the VIP up... lvs1# service keepalived stop Same as above, except the VIP is up on lvs2 and down on lvs1, and no webchecks... The web servers are correctly accessed from outside in rr... lvs1# service keepalived start Nothing... lvs1 "stuck" on VRRP sockpool, while lvs2 is still MASTER VIP down on lvs1 and up on lvs2 lvs2# service keepalived stop Baam, suddenly vrrp packets, no web server checks at all The web servers are correctly accessed from outside in rr... Both lvs have the VIP up lvs1# service keepalived stop lvs1# service keepalived start lvs2# service keepalived stop Same as above except that there are webchecks from lvs1 now... lvs2# service keepalived start backup state, no webchecks from lvs2 lvs1# service keepalived stop lvs2 => MASTER VIP is up on lvs2, down on lvs1 Everything is stuck for like 30s... and then web servers are accessible. lvs1# service keepalived start Nothing... lvs1 "stuck" on VRRP sockpool, while lvs2 is still MASTER VIP down on lvs1 and up on lvs2 lvs2# service network restart baam, vrrp packets, lvs1 transition to MASTER and sends ARPs And I get regular webchecks from both lvs... And if I bring down one web server, it is correctly removed from the services. 2mns later, no more web checks... lvs1# service keepalived stop lvs2 => MASTER VIP is down on both lvs... ARP is incomplete. Everything is stuck for ever... lvs2# ip a add dev eth0 local 192.168.16.123/32 scope global baam, vrrp packets, lvs1 entering MASTER state and sends ARPs I caught this: Netlink: error: File exists, type=(20), seq=1235574458, pid=0 Looking for errors in the logs, I found: Feb 23 16:20:20 lvs1 Keepalived_vrrp: Netlink: filter function error Feb 23 16:20:20 lvs1 Keepalived_healthcheckers: Netlink: filter function error Feb 23 16:42:58 lvs1 Keepalived_vrrp: Netlink: filter function error Feb 23 16:42:58 lvs1 Keepalived_healthcheckers: Netlink: filter function error Feb 25 12:00:50 lvs1 kernel: IPVS: ip_vs_send_async error Feb 25 12:12:04 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:04 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:05 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:05 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:05 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:05 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:06 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:06 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:06 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:06 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:07 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:07 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:07 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:07 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:08 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:08 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:08 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:08 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:09 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:09 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:09 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:09 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:10 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:10 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:11 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:11 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:12 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:12 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:13 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:13 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:14 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:14 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:15 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:16 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:12:16 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error Feb 25 12:12:17 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error Feb 25 12:33:39 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561506, pid=0 Feb 25 12:39:11 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561507, pid=0 Feb 25 12:40:10 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561508, pid=0 Feb 25 12:40:52 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561509, pid=0 Feb 23 16:20:16 lvs2 Keepalived_vrrp: Netlink: filter function error Feb 23 16:20:16 lvs2 Keepalived_healthcheckers: Netlink: filter function error Feb 23 16:42:46 lvs2 Keepalived_vrrp: Netlink: filter function error Feb 23 16:42:46 lvs2 Keepalived_healthcheckers: Netlink: filter function error Feb 23 17:35:36 lvs2 Keepalived_healthcheckers: Netlink: filter function error Feb 23 17:35:36 lvs2 Keepalived_vrrp: Netlink: filter function error Feb 25 12:25:22 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235560956, pid=0 Feb 25 12:30:50 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561435, pid=0 Feb 25 15:33:18 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235570954, pid=0 Feb 25 16:12:02 lvs2 Keepalived_vrrp: Netlink: error: Cannot assign requested address, type=(21), seq=1235574457, pid=0 Feb 25 16:29:11 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235574458, pid=0 Do you have any idea about what could be causing these problems? Thx, JD