On Thu, 2006-01-05 at 11:42, Bryan J. Smith wrote:
Web browsers (IE at least) tend to be very good about handling failures if you give out multiple IP addresses for a name and one or more locations does not respond.
Er, um, er, it's still a little arbitrary and not exactly correct. Furthermore, default NT5.x (2000+) operation is to "hold down" DNS names for a default of 2 mintues, even ones that are round-robin, if just 1 doesn't resolve. It's a really messy default in the Windows client that causes a lot of issues.
The 'round-robin' concept just means that the server will rotate the order of the addresses in the answer. All addresses are still visible to the client and in the caches. Try 'nslookup www.ibm.com' to see the effect of multiple A records for the same name.
IE will try them all. Try setting up multiple A records in your DNS with one pointing to a working web server and one not and see if you even notice a difference when connecting to that name. On the other hand, if you've given it a single IP address in the first DNS lookup, then change the DNS response you'll have to close all instances of IE to make it pick up the change.
I think you might be thinking of ADS name resolution, which is a little different than DNS (even though Microsoft says it's DNS ;-). I could be wrong though, but that's what my experience suggests.
No, I mean multiple A records. Most apps are dumb and only try the first one in the list returned so the round robin rotation on the server side gives statistical load balancing but apps other than web browsers tend to fail if the first address doesn't respond.
There are expensive commercial DNS servers like F5's 3dns that will test for service availability and modify the response if a location is down. Some free variations may also be available.
But that still doesn't solve the propogation issue. The most you could hope for is to find a partner who can seed the major caching servers of the major providers. But there's still the downstream issue.
F5 uses a 30 second TTL by default on responses that can change dynamically. It works well enough through normal caches but apps normally keep their first answer until you restart them.
Again, the repeat theme here is that it must be solved at the layer-3/IP level. You can't hope to solve it at the application levels, like with DNS.
On the contrary, the app is the best place to deal with it if you can. That is, always return all possible IP addresses in the DNS query (or at least all working sites) and let the app walk through the list until it gets a connection that works. I have quite a bit of experience with this and that approach is even better than trying to juggle DNS dynamically except for the case where you want to force clients to one location or the other. For example, you might temporarily have local routing problems at some location that make it impossible to connect to one site or the other that no other test could detect, and if the app has both IP addresses it can still get to the one that works. However, it only works for web apps and ones where you write the client yourself. The standard library 'connect' library routines will try one address and give up.