[CentOS] Keepalive...

Tue Feb 24 16:11:49 UTC 2009
John Doe <jdmls at yahoo.com>

> Test setup:
>
>       main switch (192.168.16.0/20)
>                     |
>         eth0: 192.168.28.[226|227]
>             VIP=192.168.16.123
>          2 lvs/keepalived servers
>             eth1: 10.0.0.[1|2]
>                     |
>         test switch (10.0.0.0/8)
>                     |
>               10.0.0.[11|12]
>             VIP=192.168.16.123
>         test servers (real servers)
>            192.168.16.[228|229]
>                     |
>            back to main switch

Hi again,

I stopped the servers for the week-end...  Restarted them on monday and... it did not work anymore.
Tried 1.1.15 as suggested, same.
So, I installed the keepalived-1.1.16-1.el5.hrb rpm David kindly built.
And, not really better.
My config more or less work... More 'less' than 'more' sadly...
I have many "random" problems and weird behaviors, that fix themselves after a few restarts/reboots, without changing anything in my conf.  And they will be back at the next restart...

Once, it is the vrrp stuff that do not seem to work.
I say seem because, even if tcpdump does not show any vrrp packets (it does other times), sometimes the backup catches the master that was brought down and switches to master state. And, at other times, both detects nothing at all...  A few restarts and it works again until next failure.  And at other times, I can see the packets...

There were times when both would be master...

Another time, keepalive does not seem to check the webservers as regularly as other times.
I say again seem because, while the accesslog of my webserver does not display any recent entry from keepalive (hash) checks, keepalive still detects that one web server was brought down and that it temporarly removes it from its list...  And I see nothing in keepalived logs about this...  Except once in a while.
By example, right now lvs1 is master, and I see only lvs2 checks in my web logs.
But if I bring down web1, lvs1 catches it and removes it until I bring it back up...

Another time, the arp resolution on my client for the VIP is "incomplete".  Fixed after a few restarts.
Many times, the master gets stuck on "VRRP sockpool".

Each time there is a problem, I checked and both my web servers are accessible from the 2 lvs servers and from outside through the exit IPs (192.168.16.[228|229]).

Also, when I use service restart, it will fail once out of 3 times with "Keepalived: daemon is already running"...

Am I the only one having all these unstabilities?

Thx,
JD