[CentOS] Keepalived - spurious failovers

Wed Nov 12 21:04:01 UTC 2014
John Horne <john.horne at plymouth.ac.uk>

On Wed, 2014-11-12 at 15:44 +0000, Richard Mann wrote:
> 
> +1 to your logrotate thought; I'd dig deeper there.
> 
> check /var/lib/logrotate.status; see if it doesn't match up with days
> the failover happens, that different httpd logs are rotating.  
>
Given that failover only occurs if Apache, Tomcat or the NIC fail, I
can't find anything in log rotation that could cause this effect. For
failover to occur the Apache/Tomcat process must be non-existent (in our
case keepalived checks for them using pgrep). We have secondary
monitoring of these processes (Xymon using checks of 'ps'), and that
shows no such failure. Simply logging into the servers and running ps
shows that they are running. I would hope that something would be logged
by either process in the appropriate log file, but nothing is seen. Of
course it could be something dire that simply kills the process dead,
but again we do not see that at all (ps shows they are present). So that
leaves the NIC. Again, I cannot think of any process (day or night) that
would cause the NIC to fail (or restart) - that would be a serious
problem. Secondly, keepalived should log the fact and put itself into a
FAULT state. I tested this on a test server, and it worked as described.
We, however, see no such fault state or log messages on our live
servers.

So, I am very much stumped as to the problem. I'm hoping that if
keepalived fails over tonight, then the cron jobs I have set up may give
a clue.




John.

-- 
----------------------------------------------------
John Horne                   Tel: +44 (0)1752 587287
Plymouth University, UK