[CentOS] Keepalive...

Wed Feb 25 15:40:52 UTC 2009
John Doe <jdmls at yahoo.com>

I went a bit further...

lvs1# service keepalived stop
lvs2# service keepalived stop
lvs1# service network restart
lvs2# service network restart

Clean start

lvs1# service keepalived start

Feb 25 15:03:18 lvs1 Keepalived: Starting Keepalived v1.1.16 (02/17,2009) 
Feb 25 15:03:18 lvs1 Keepalived: Starting Healthcheck child process, pid=9511
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Using MII-BMSR NIC polling thread...
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Netlink reflector reports IP 192.168.28.226 added
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Netlink reflector reports IP 10.0.0.1 added
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Registering Kernel netlink reflector
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Registering Kernel netlink command channel
Feb 25 15:03:18 lvs1 Keepalived: Starting VRRP child process, pid=9512
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Using MII-BMSR NIC polling thread...
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Netlink reflector reports IP 192.168.28.226 added
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Netlink reflector reports IP 10.0.0.1 added
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Registering Kernel netlink reflector
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Registering Kernel netlink command channel
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Registering gratutious ARP shared channel
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Opening file '/etc/keepalived/keepalived.conf'. 
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Configuration is using : 13235 Bytes
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.11:80]
Feb 25 15:03:18 lvs1 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.12:80]
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Opening file '/etc/keepalived/keepalived.conf'. 
Feb 25 15:03:18 lvs1 Keepalived_vrrp: Configuration is using : 34062 Bytes
Feb 25 15:03:18 lvs1 Keepalived_vrrp: VRRP sockpool: [ifindex(2), proto(112), fd(10,11)]

No VIP and no checks on the web servers...

lvs2# service keepalived start

Feb 25 15:05:23 lvs2 Keepalived: Starting Keepalived v1.1.16 (02/17,2009) 
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Using MII-BMSR NIC polling thread...
Feb 25 15:05:23 lvs2 Keepalived: Starting Healthcheck child process, pid=8718
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Using MII-BMSR NIC polling thread...
Feb 25 15:05:23 lvs2 Keepalived: Starting VRRP child process, pid=8719
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Netlink reflector reports IP 192.168.28.227 added
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Netlink reflector reports IP 10.0.0.2 added
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Registering Kernel netlink reflector
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Registering Kernel netlink command channel
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Netlink reflector reports IP 192.168.28.227 added
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Netlink reflector reports IP 10.0.0.2 added
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Registering Kernel netlink reflector
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Registering Kernel netlink command channel
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Registering gratutious ARP shared channel
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Opening file '/etc/keepalived/keepalived.conf'. 
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Configuration is using : 13233 Bytes
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.11:80]
Feb 25 15:05:23 lvs2 Keepalived_healthcheckers: Activating healtchecker for service [10.0.0.12:80]
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Opening file '/etc/keepalived/keepalived.conf'. 
Feb 25 15:05:23 lvs2 Keepalived_vrrp: Configuration is using : 34060 Bytes
Feb 25 15:05:23 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Entering BACKUP STATE
Feb 25 15:05:23 lvs2 Keepalived_vrrp: VRRP sockpool: [ifindex(2), proto(112), fd(10,11)]

No VIP and only one check on the web servers...

lvs1# service keepalived stop

Feb 25 15:07:30 lvs1 Keepalived: Terminating on signal
Feb 25 15:07:30 lvs1 Keepalived: Stopping Keepalived v1.1.16 (02/17,2009) 
Feb 25 15:07:30 lvs1 Keepalived_vrrp: Terminating VRRP child process on signal
Feb 25 15:07:30 lvs1 Keepalived_healthcheckers: Terminating Healthchecker child process on signal

And nothing else (lvs2 does not become MASTER)...

lvs1# service keepalived start

Nothing much...

lvs2# service keepalived stop
lvs2# service keepalived start

Nothing and no checks on the web servers...

lvs1# service keepalived stop
lvs1# service keepalived start

Nothing and no checks on the web servers...

lvs1# service keepalived stop
lvs1# service keepalived start

Nothing and only one check on the web servers...
Always stuck on "VRRP sockpool"

By the way, a restart or a stop+restart too fast too often leads to a failed start with "daemon is already running"

lvs1# service keepalived restart

Nothing and no checks on the web servers...

lvs1# service keepalived restart

Nothing and no checks on the web servers...

lvs1# service keepalived restart

Nothing and no checks on the web servers...

lvs1# service keepalived restart

Baam, suddenly many vrrp packets, and one web servers check

Feb 25 15:15:11 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Received lower prio advert, forcing new election
Feb 25 15:15:11 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.16.123
Feb 25 15:15:11 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.16.123
Feb 25 15:15:16 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Received lower prio advert, forcing new election
Feb 25 15:15:16 lvs1 Keepalived_vrrp: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.16.123

Feb 25 15:14:50 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Transition to MASTER STATE
Feb 25 15:14:50 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Received higher prio advert
Feb 25 15:14:50 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Entering BACKUP STATE
Feb 25 15:14:55 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Transition to MASTER STATE
Feb 25 15:14:55 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Received higher prio advert
Feb 25 15:14:55 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Entering BACKUP STATE

The web servers are correctly accessed from outside in rr; but there are still no web checks from the keepalives...

lvs1# ipvsadm

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.16.123:http rr
  -> 10.0.0.12:http               Route   1      0          28
  -> 10.0.0.11:http               Route   1      0          28

lvs2# ipvsadm

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.16.123:http rr
  -> 10.0.0.12:http               Route   1      0          0         
  -> 10.0.0.11:http               Route   1      0          0

lvs1# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:04:23:9e:f3:74 brd ff:ff:ff:ff:ff:ff
    inet 192.168.28.226/20 brd 192.168.31.255 scope global eth0
    inet 192.168.16.123/32 scope global eth0
    inet6 fe80::204:23ff:fe9e:f374/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:04:23:9e:f3:75 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/8 brd 10.255.255.255 scope global eth1
    inet6 fe80::204:23ff:fe9e:f375/64 scope link 
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop 
    link/sit 0.0.0.0 brd 0.0.0.0

No VIP on lvs2 (BACKUP state)

lvs1# service keepalived stop

Feb 25 15:29:06 lvs2 Keepalived_vrrp: VRRP_Instance(VI_1) Transition to MASTER STATE
tcpdump => VRRP.MCAST.NET: VRRPv2, Advertisement, vrid 51, prio 0, authtype none, intvl 1s, length 20
No VIP on lvs1 and lvs2, ARP resolution for VIP incomplete...

lvs2# ip a add dev eth0 local 192.168.16.123/32 scope global

Baam, suddenly vrrp packets, and one round (only) of web server checks
15:33:18.639546 IP lvs2.iper > VRRP.MCAST.NET: VRRPv2, Advertisement, vrid 51, prio 99, authtype none, intvl 1s, length 20
15:33:19.641002 IP lvs2.iper > VRRP.MCAST.NET: VRRPv2, Advertisement, vrid 51, prio 99, authtype none, intvl 1s, length 20

lvs1# service keepalived start

Nothing...

lvs2# service keepalived stop

Baam, suddenly vrrp packets, and one round (only) of web server checks
The web servers are correctly accessed from outside in rr...

lvs2# service keepalived start

Nothing, other than Entering BACKUP STATE
Both lvs have the VIP up...

lvs1# service keepalived stop

Same as above, except the VIP is up on lvs2 and down on lvs1, and no webchecks...
The web servers are correctly accessed from outside in rr...

lvs1# service keepalived start

Nothing...
lvs1 "stuck" on VRRP sockpool, while lvs2 is still MASTER
VIP down on lvs1 and up on lvs2

lvs2# service keepalived stop

Baam, suddenly vrrp packets, no web server checks at all
The web servers are correctly accessed from outside in rr...
Both lvs have the VIP up

lvs1# service keepalived stop
lvs1# service keepalived start
lvs2# service keepalived stop

Same as above except that there are webchecks from lvs1 now...

lvs2# service keepalived start

backup state, no webchecks from lvs2

lvs1# service keepalived stop

lvs2 => MASTER
VIP is up on lvs2, down on lvs1
Everything is stuck for like 30s... and then web servers are accessible.

lvs1# service keepalived start

Nothing...
lvs1 "stuck" on VRRP sockpool, while lvs2 is still MASTER
VIP down on lvs1 and up on lvs2

lvs2# service network restart

baam, vrrp packets, lvs1 transition to MASTER and sends ARPs
And I get regular webchecks from both lvs...
And if I bring down one web server, it is correctly removed from the services.
2mns later, no more web checks...

lvs1# service keepalived stop

lvs2 => MASTER
VIP is down on both lvs...  ARP is incomplete.
Everything is stuck for ever...

lvs2# ip a add dev eth0 local 192.168.16.123/32 scope global

baam, vrrp packets, lvs1 entering MASTER state and sends ARPs
I caught this: Netlink: error: File exists, type=(20), seq=1235574458, pid=0

Looking for errors in the logs, I found:

Feb 23 16:20:20 lvs1 Keepalived_vrrp: Netlink: filter function error
Feb 23 16:20:20 lvs1 Keepalived_healthcheckers: Netlink: filter function error
Feb 23 16:42:58 lvs1 Keepalived_vrrp: Netlink: filter function error
Feb 23 16:42:58 lvs1 Keepalived_healthcheckers: Netlink: filter function error
Feb 25 12:00:50 lvs1 kernel: IPVS: ip_vs_send_async error
Feb 25 12:12:04 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:04 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:05 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:05 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:05 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:05 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:06 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:06 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:06 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:06 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:07 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:07 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:07 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:07 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:08 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:08 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:08 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:08 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:09 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:09 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:09 lvs1 Keepalived_vrrp: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:09 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:10 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:10 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:11 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:11 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:12 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:12 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:13 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:13 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:14 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:14 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:15 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:16 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:12:16 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth0 failed: Input/output error
Feb 25 12:12:17 lvs1 Keepalived_healthcheckers: SIOCGMIIREG on eth1 failed: Input/output error
Feb 25 12:33:39 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561506, pid=0
Feb 25 12:39:11 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561507, pid=0
Feb 25 12:40:10 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561508, pid=0
Feb 25 12:40:52 lvs1 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561509, pid=0

Feb 23 16:20:16 lvs2 Keepalived_vrrp: Netlink: filter function error
Feb 23 16:20:16 lvs2 Keepalived_healthcheckers: Netlink: filter function error
Feb 23 16:42:46 lvs2 Keepalived_vrrp: Netlink: filter function error
Feb 23 16:42:46 lvs2 Keepalived_healthcheckers: Netlink: filter function error
Feb 23 17:35:36 lvs2 Keepalived_healthcheckers: Netlink: filter function error
Feb 23 17:35:36 lvs2 Keepalived_vrrp: Netlink: filter function error
Feb 25 12:25:22 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235560956, pid=0
Feb 25 12:30:50 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235561435, pid=0
Feb 25 15:33:18 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235570954, pid=0
Feb 25 16:12:02 lvs2 Keepalived_vrrp: Netlink: error: Cannot assign requested address, type=(21), seq=1235574457, pid=0
Feb 25 16:29:11 lvs2 Keepalived_vrrp: Netlink: error: File exists, type=(20), seq=1235574458, pid=0

Do you have any idea about what could be causing these problems?

Thx,
JD