[CentOS] High Availability using 2 sites

Les Mikesell lesmikesell at gmail.com
Thu Jan 5 18:42:40 UTC 2006


On Thu, 2006-01-05 at 11:42, Bryan J. Smith wrote:

> > Web browsers (IE at least) tend to be very good about
> > handling failures if you give out multiple IP addresses for
> > a name and one or more locations does not respond.
> 
> Er, um, er, it's still a little arbitrary and not exactly
> correct.  Furthermore, default NT5.x (2000+) operation is to
> "hold down" DNS names for a default of 2 mintues, even ones
> that are round-robin, if just 1 doesn't resolve.  It's a
> really messy default in the Windows client that causes a lot
> of issues.

The 'round-robin' concept just means that the server will
rotate the order of the addresses in the answer.  All addresses
are still visible to the client and in the caches.  Try
'nslookup www.ibm.com'  to see the effect of multiple
A records for the same name. 

IE will try them all.  Try setting up multiple A records
in your DNS with one pointing to a working web server and
one not and see if you even notice a difference when connecting
to that name.  On the other hand, if you've given it a single
IP address in the first DNS lookup, then change the DNS response
you'll have to close all instances of IE to make it pick up
the change.

> I think you might be thinking of ADS name resolution, which
> is a little different than DNS (even though Microsoft says
> it's DNS ;-).  I could be wrong though, but that's what my
> experience suggests.

No, I mean multiple A records.  Most apps are dumb and
only try the first one in the list returned so the round
robin rotation on the server side gives statistical load
balancing but apps other than web browsers tend to fail
if the first address doesn't respond.

> > There are expensive commercial DNS servers like F5's
> > 3dns that will test for service availability and modify
> > the response if a location is down.   Some free variations
> > may also be available.
> 
> But that still doesn't solve the propogation issue.  The most
> you could hope for is to find a partner who can seed the
> major caching servers of the major providers.  But there's
> still the downstream issue.

F5 uses a 30 second TTL by default on responses that can
change dynamically.  It works well enough through normal
caches but apps normally keep their first answer until
you restart them.

> Again, the repeat theme here is that it must be solved at the
> layer-3/IP level.  You can't hope to solve it at the
> application levels, like with DNS.

On the contrary, the app is the best place to deal with it
if you can.  That is, always return all possible IP addresses
in the DNS query (or at least all working sites) and let
the app walk through the list until it gets a connection
that works.  I have quite a bit of experience with this and
that approach is even better than trying to juggle DNS
dynamically except for the case where you want to force
clients to one location or the other.  For example, you
might temporarily have local routing problems at some
location that make it impossible to connect to one site
or the other that no other test could detect, and if the
app has both IP addresses it can still get to the one that
works.  However, it only works for web apps and ones where
you write the client yourself. The standard library 'connect'
library routines will try one address and give up.

-- 
  Les Mikesell
    lesmikesell at gmail.com





More information about the CentOS mailing list