High Availability using 2 sites

List overview All Threads
Download

newer

older

CentOS x64 crashed

rc script not running on shutdown

Tim Edwards

5 Jan 2006 5 Jan '06

6:10 a.m.

We currently have a backup site at a different location to our main site. This backup site mirrors (as closely as possible) our main services, particularly web serving. Is there a way to have the backup site act as a failover for the main site using something like Linux-HA? They are on seperate internet connections with different IP ranges.

Thanks -- Tim Edwards

Show replies by date

Bryan J. Smith

5 Jan 5 Jan

6:21 a.m.

Tim Edwards tim@registriesltd.com.au wrote:

...

We currently have a backup site at a different location to our main site. This backup site mirrors (as closely as possible) our main services, particularly web serving. Is there a way to have the backup site act as a failover for the main site using something like Linux-HA? They are on seperate internet connections with different IP ranges.

Yes and no.

Yes in that you have a couple of options -- one common, one pretty much a hack.

The common one is to have your own autonomous system number and run BGP. That way you control your IP assignments, failover, etc... in ways that are efficient and quickly propogated.

The hack is to put routers and/or 1-to-1 NAT devices at each site, which can redirect traffic to the other site. That is less efficient and can cause some headaches.

No in the fact that there's really no "software" or "service" facility to deal with this. Round robin DNS does nothing to solve this. Name propogation is always an issue.

So it's something you can only address at the IP-level -- either by having your own, Internet-recognized autonomous system number, or redirecting IPs from each site to the other when servers/sites go down.

-- Bryan J. Smith Professional, Technical Annoyance b.j.smith@ieee.org http://thebs413.blogspot.com ---------------------------------------------------- *** Speed doesn't kill, difference in speed does ***

Benjamin Smith

8:11 p.m.

I've been exploring high-uptime availability solutions for our own, database-driven ASP. We have two sites, much as original poster describes, and 5-minute DNS, but many larger providers (EG: SBC, AOL) have DNS servers that seem to ignore TTL. Complicating the DNS issues are ones around keeping a database replicated (using Postgres 8.1) and a filesystem synchronized. (we actually have a "dark" 3rd, non-public site used for business continuity in a totally worst-case scenario)

So, I've been at a quandary on this very same issue. We had a problem about 2 years ago where we had to switch to the failover in an emergency. From our end, we were "back up" in < 3 hours, (far less than the 6 hours allowed by contract) but it took over 48 hours for availability to approach 100%, due to the aforementioned DNS issues. (I hate you, SBC!)

So, do you know of a "getting started" for how to get an autonomous system number and run BGP? My skills as a network admin are a distant second to my primary skills...

-Ben

On Wednesday 04 January 2006 22:21, Bryan J. Smith wrote:

...

The common one is to have your own autonomous system number and run BGP. That way you control your IP assignments, failover, etc... in ways that are efficient and quickly propogated.

-- "The best way to predict the future is to invent it." - XEROX PARC slogan, circa 1978

Bryan J. Smith

9:03 p.m.

Benjamin Smith lists@benjamindsmith.com wrote:

...

I've been exploring high-uptime availability solutions for our own, database-driven ASP. We have two sites, much as original poster describes, and 5-minute DNS, but many

larger

...

providers (EG: SBC, AOL) have DNS servers that seem to

ignore

...

TTL.

Apparently many others here serve content to users on networks other than those on AOL, SBC, etc... Or at least their comments seem to repeat that. ;->

It's gotta be either that, or it's the reality that they keep testing when their servers are using DNS that talk directly to (or are the) authority for the domain.

I suspect the latter. ;->

...

So, I've been at a quandary on this very same issue. We had a problem about 2 years ago where we had to switch to the failover in an emergency. From our end, we were "back up"

...

< 3 hours, (far less than the 6 hours allowed by contract)

but

...

it took over 48 hours for availability to approach 100%,

due to

...

the aforementioned DNS issues. (I hate you, SBC!)

Which is why you need 1-to-1 NAT for near-immediate uptime.

Of course, that doesn't help you if the provider of those IPs can't reach your 1-to-1 NAT equipment. That's why it's not a true failover.

An ideal, although bandwidth using, solution is to keep your router/1-to-1 NAT equipment at different locations. E.g., you have 4 sites -- 2 router/NAT, 2 servers. Only if and when you lost 3 sites would you go down.

But that gets mighty expensive. Which brings us to the next concept ...

...

So, do you know of a "getting started" for how to get an autonomous system number and run BGP? My skills as a

network

...

admin are a distant second to my primary skills...

In the US, start with the authority, ARIN -- http://www.arin.net/ They will give you the "do it yourself" cost.

I suspect you're below that, so you need to talk to your provider(s) about a solution. Unless you have just 1 provider (which means you're putting all your eggs in their backbone basket), it's pretty tough to do without ARIN at some point.

That's why it's typically better to rely on a partner who already has their own AS, and ties into 3+ providers. Again, this isn't something that you can do on your own, unless you have a lot of dough.

Again, this is where the small-to-medium ASP finds him/herself at the point where they either have to make a major investment to go bigger, partner to go bigger (although they will always be smaller than the partner), etc... It is *NEVER* something you can do in software, and that's the chronically dead wrong assumption.

John Hinton

9:15 p.m.

Benjamin Smith wrote:

...

I've been exploring high-uptime availability solutions for our own, database-driven ASP. We have two sites, much as original poster describes, and 5-minute DNS, but many larger providers (EG: SBC, AOL) have DNS servers that seem to ignore TTL.

It is important to use TTLs for the various services within a specified range. Too short will get you ignored, at least after a while. Too long.. same thing. I've never had a problem with setting the TTLs low for a few days before a transfer or some such, but then set them back up into acceptable ranges after the move.

Good help on the proper ranges can be found on http://dnsreport.com

Five minutes will more than likely get you ignored after a few days or weeks. Imagine if everyone set their TTL to five minutes.. the root nameservers would be looking up every record on the net once every five minutes... a pretty arduous task for 13 servers. And if you want to find out what happens if you don't use cached DNS, try turning it off at the router level sometime for fun.... s--------l----------o----------w!!!! Heck, 1200bps dialups act like T-1s compared to no caching.

A side point, I think some have mentioned DNS on Windows boxes. There is caching of DNS on windows machines as well. This is not a function of MSIE nor Outlook or whatever application you use, but instead is in some way central to the network on the system, to which MSIE/etc. makes calls for DNS. Botched DNS (corruption) will at least take a reboot of the system.. or a cache flush from the command line. I've seen some things get cached that requires an edit of a file on the system.. but I don't remember the location or the name of that file.

Best, John Hinton

Benjamin Smith

6 Jan 6 Jan

2:42 a.m.

On Thursday 05 January 2006 13:15, John Hinton wrote:

...

It is important to use TTLs for the various services within a specified range. Too short will get you ignored, at least after a while. Too long.. same thing. I've never had a problem with setting the TTLs low for a few days before a transfer or some such, but then set them back up into acceptable ranges after the move.

Good help on the proper ranges can be found on http://dnsreport.com

Tried that. Suggestions? http://dnsreport.com/tools/dnsreport.ch?domain=schoolpathways.com

...

Five minutes will more than likely get you ignored after a few days or weeks. Imagine if everyone set their TTL to five minutes.. the root nameservers would be looking up every record on the net once every five minutes... a pretty arduous task for 13 servers. And if you want to find out what happens if you don't use cached DNS, try turning it off at the router level sometime for fun.... s--------l----------o----------w!!!! Heck, 1200bps dialups act like T-1s compared to no caching.

That's fine - but how do I minimize downtime in a failover scenario? (Thus, my questions about BGP, which you don't seem to mention)

In the past, when I 'cut down' the TTL to 5 minutes, I did so about 1 week before the switch. (that was the TTL on the domains, so it was the shortest I could do it.) I still had the aforementioned problem.

-Ben

-- "The best way to predict the future is to invent it." - XEROX PARC slogan, circa 1978

Les Mikesell

5 Jan 5 Jan

2:34 p.m.

On Thu, 2006-01-05 at 00:10, Tim Edwards wrote:

...

We currently have a backup site at a different location to our main site. This backup site mirrors (as closely as possible) our main services, particularly web serving. Is there a way to have the backup site act as a failover for the main site using something like Linux-HA? They are on seperate internet connections with different IP ranges.

Web browsers (IE at least) tend to be very good about handling failures if you give out multiple IP addresses for a name and one or more locations does not respond. When both work the load will balance across them. If you provide the client software for other services you can build in similar robustness by getting the list from DNS and trying each until you get a connection (don't retry too fast if you expect to have a lot of clients...).

There are expensive commercial DNS servers like F5's 3dns that will test for service availability and modify the response if a location is down. Some free variations may also be available. For a few services you could probably write your own fairly easily - you just have to use a short TTL on the DNS records. However, most applications cache the DNS response internally regardless of the TTL and won't automatically pick up a change unless you exit the app and restart it. IE does this too, but if you have given out 2 addresses and one subsequently stops working it seems to do the right thing where if you give out one address first then change it you have to exit and restart to pick up the new one.

-- Les Mikesell lesmikesell@gmail.com

Bryan J. Smith

5:42 p.m.

Les Mikesell lesmikesell@gmail.com wrote:

...

Web browsers (IE at least) tend to be very good about handling failures if you give out multiple IP addresses for a name and one or more locations does not respond.

Er, um, er, it's still a little arbitrary and not exactly correct. Furthermore, default NT5.x (2000+) operation is to "hold down" DNS names for a default of 2 mintues, even ones that are round-robin, if just 1 doesn't resolve. It's a really messy default in the Windows client that causes a lot of issues.

I think you might be thinking of ADS name resolution, which is a little different than DNS (even though Microsoft says it's DNS ;-). I could be wrong though, but that's what my experience suggests.

...

There are expensive commercial DNS servers like F5's 3dns that will test for service availability and modify the response if a location is down. Some free variations may also be available.

But that still doesn't solve the propogation issue. The most you could hope for is to find a partner who can seed the major caching servers of the major providers. But there's still the downstream issue.

...

However, most applications cache the DNS response

internally

...

regardless of the TTL and won't automatically pick up a

change

...

unless you exit the app and restart it.

Exactomundo, let alone if the OS/resolver or whatever "cached value" at the "non-authority" honors the TTL in the first place.

Again, the repeat theme here is that it must be solved at the layer-3/IP level. You can't hope to solve it at the application levels, like with DNS.

Les Mikesell

6:42 p.m.

On Thu, 2006-01-05 at 11:42, Bryan J. Smith wrote:

...

...
Web browsers (IE at least) tend to be very good about handling failures if you give out multiple IP addresses for a name and one or more locations does not respond.

Er, um, er, it's still a little arbitrary and not exactly correct. Furthermore, default NT5.x (2000+) operation is to "hold down" DNS names for a default of 2 mintues, even ones that are round-robin, if just 1 doesn't resolve. It's a really messy default in the Windows client that causes a lot of issues.

The 'round-robin' concept just means that the server will rotate the order of the addresses in the answer. All addresses are still visible to the client and in the caches. Try 'nslookup www.ibm.com' to see the effect of multiple A records for the same name.

IE will try them all. Try setting up multiple A records in your DNS with one pointing to a working web server and one not and see if you even notice a difference when connecting to that name. On the other hand, if you've given it a single IP address in the first DNS lookup, then change the DNS response you'll have to close all instances of IE to make it pick up the change.

...

I think you might be thinking of ADS name resolution, which is a little different than DNS (even though Microsoft says it's DNS ;-). I could be wrong though, but that's what my experience suggests.

No, I mean multiple A records. Most apps are dumb and only try the first one in the list returned so the round robin rotation on the server side gives statistical load balancing but apps other than web browsers tend to fail if the first address doesn't respond.

...

...
There are expensive commercial DNS servers like F5's 3dns that will test for service availability and modify the response if a location is down. Some free variations may also be available.

But that still doesn't solve the propogation issue. The most you could hope for is to find a partner who can seed the major caching servers of the major providers. But there's still the downstream issue.

F5 uses a 30 second TTL by default on responses that can change dynamically. It works well enough through normal caches but apps normally keep their first answer until you restart them.

...

Again, the repeat theme here is that it must be solved at the layer-3/IP level. You can't hope to solve it at the application levels, like with DNS.

On the contrary, the app is the best place to deal with it if you can. That is, always return all possible IP addresses in the DNS query (or at least all working sites) and let the app walk through the list until it gets a connection that works. I have quite a bit of experience with this and that approach is even better than trying to juggle DNS dynamically except for the case where you want to force clients to one location or the other. For example, you might temporarily have local routing problems at some location that make it impossible to connect to one site or the other that no other test could detect, and if the app has both IP addresses it can still get to the one that works. However, it only works for web apps and ones where you write the client yourself. The standard library 'connect' library routines will try one address and give up.

-- Les Mikesell lesmikesell@gmail.com

Bryan J. Smith

7:01 p.m.

Les Mikesell lesmikesell@gmail.com wrote:

...

The 'round-robin' concept just means that the server will rotate the order of the addresses in the answer. All addresses are still visible to the client and in the

caches.

...

Try 'nslookup www.ibm.com' to see the effect of multiple A records for the same name.

Yes, I know how it works. What I'm saying is that I don't think the Windows resolver, before they even get to MS IE, works as you believe. At least not in an Internet environment. The Windows resolver is very, very different than most UNIX resolvers, including a "hold down" for not just failed resolution, but failed acces.

...

IE will try them all. Try setting up multiple A records in your DNS with one pointing to a working web server and one not and see if you even notice a difference when connecting to that name.

Furthermore, I made the addition point that I think you're crossing some attributes of DNS with those of ActiveDirectory Server (ADS) integrated DNS.

This is the Windows Resolver at work, not so much MS IE, although the integration for ADS-integrated DNS and ADS-integrated application like MS IE, do some interesting things very _differently_ and _separate_ from how the Windows resolver works for _Internet_ addresses. ;->

...

On the other hand, if you've given it a single IP address in the first DNS lookup, then change the DNS response you'll have to close all instances of IE to make

...

pick up the change.

Again, there's a lot of logic at the Windows resolver at work that you're not considering. And then there are resolution issues both at the Windows resolver and the application that work very differently than MS IE.

...

No, I mean multiple A records.

But on what server?

A true BIND or similar DNS server or Windows DNS Server?

...

Most apps are dumb and only try the first one in the list returned so the round robin rotation on the server side

gives

...

statistical load balancing but apps other than web browsers tend to fail if the first address doesn't respond.

I think you're crossing some concepts that MS IE doesn't handle, but the Windows resolver does. And then there are ADS considerations as well.

...

F5 uses a 30 second TTL by default on responses that can change dynamically. It works well enough through normal caches but apps normally keep their first answer until you restart them.

But there is a lot of arbitrary cache/resolution between their authority and your end-usage. That's always going to be an issue.

...

On the contrary, the app is the best place to deal with it if you can. That is, always return all possible IP addresses in the DNS query (or at least all working sites) and let the app walk through the list until it gets a connection that works.

Again, arbitrary and you can not only _not_ trust the apps to work that way, but worse yet, there's a lot of cache/resolution between you, the authority, and the end system.

If you're going directly to the authority (especially if you are the authority), then yeah, it can and will work. But for any arbitrary Internet user, there is a lot left to chance and layers between the authority and them.

IP address is the only guarantee. That's why people get AS numbers. You have to appear to be a single point from the standpoint of the Internet, even if you're getting your connections from 2-3 different providers.

...

I have quite a bit of experience with this and that

approach

...

is even better than trying to juggle DNS dynamically except for the case where you want to force clients to one

location

...

or the other. For example, you might temporarily have

local

...

routing problems at some location that make it impossible

...

connect to one site or the other that no other test could detect, and if the app has both IP addresses it can still

get

...

to the one that works.

Yes, that works when _you_ can _guarantee_ that all clients will talk _directly_ to the authority, or control intermedia cache/non-authorities that guarantee adherence to the TTL. That's why it works for intranets as well as Internet networks _you_ control.

But everything changes when you have people who don't access the authority of the domain. And to rely on an application is rather arbitrary, especially how I've seen both the Windows resolver and MS IE act.

...

However, it only works for web apps and ones where you write the client yourself. The standard library 'connect' library routines will try one address and give

up.

Yes, which is why you can't trust it. Even if you do write it, you're making the assumptions. What if the service is not acting like you assume? DNS does not provide what it seems from the standpoint of different utilities (let alone versions), and Microsoft's ADS-integrated works very, very different to make matters worse.

Les Mikesell

7:59 p.m.

On Thu, 2006-01-05 at 13:01, Bryan J. Smith wrote:

...

Les Mikesell lesmikesell@gmail.com wrote:

...
The 'round-robin' concept just means that the server will rotate the order of the addresses in the answer. All addresses are still visible to the client and in the

caches.

...
Try 'nslookup www.ibm.com' to see the effect of multiple A records for the same name.

Yes, I know how it works. What I'm saying is that I don't think the Windows resolver, before they even get to MS IE, works as you believe. At least not in an Internet environment. The Windows resolver is very, very different than most UNIX resolvers, including a "hold down" for not just failed resolution, but failed acces.

You are missing the point. If you put multiple A records for the same name in DNS, all clients will see them all the time whether they work or not and whether anything caches them or not.

...

...
IE will try them all. Try setting up multiple A records in your DNS with one pointing to a working web server and one not and see if you even notice a difference when connecting to that name.

Furthermore, I made the addition point that I think you're crossing some attributes of DNS with those of ActiveDirectory Server (ADS) integrated DNS.

The DNS server is irrelevant here. Any server should be able to serve multiple A records for a name, all the time.

...

This is the Windows Resolver at work, not so much MS IE, although the integration for ADS-integrated DNS and ADS-integrated application like MS IE, do some interesting things very _differently_ and _separate_ from how the Windows resolver works for _Internet_ addresses. ;->

And the resolver is irrelevant as well. Any client should be able to see the list of addresses, all the time.

...

...
No, I mean multiple A records.

But on what server?

A true BIND or similar DNS server or Windows DNS Server?

It doesn't matter. Assign multiple A records, you get a list of IP addresses as the answer.

...

...
Most apps are dumb and only try the first one in the list returned so the round robin rotation on the server side

gives

...
statistical load balancing but apps other than web browsers tend to fail if the first address doesn't respond.

I think you're crossing some concepts that MS IE doesn't handle, but the Windows resolver does. And then there are ADS considerations as well.

No, there are issues with the stock connect() routine.

...

...
F5 uses a 30 second TTL by default on responses that can change dynamically. It works well enough through normal caches but apps normally keep their first answer until you restart them.

But there is a lot of arbitrary cache/resolution between their authority and your end-usage. That's always going to be an issue.

In the dynamic scenario, you have a possible problem of cache admins configuring to use a minimum time of their own choice rather than following the spec, but that is rare. And it doesn't affect an unchanging list.

...

...
On the contrary, the app is the best place to deal with it if you can. That is, always return all possible IP addresses in the DNS query (or at least all working sites) and let the app walk through the list until it gets a connection that works.

Again, arbitrary and you can not only _not_ trust the apps to work that way, but worse yet, there's a lot of cache/resolution between you, the authority, and the end system.

If you write the app you can trust it to work the way you wrote it and you don't have to worry about anyone's cache. That why I suggest doing it that way. Always give out multiple IP addresses and don't change DNS. Write the app to walk the list of returned addresses itself if the first one it tries doesn't respond. This seems to already be done in the common web browsers.

...

IP address is the only guarantee. That's why people get AS numbers. You have to appear to be a single point from the standpoint of the Internet, even if you're getting your connections from 2-3 different providers.

Not really. If you can't control the app you might have to live with this. Otherwise you can give out several IP addresses for a name and let the app decide which one is reachable from it's location.

...

...
I have quite a bit of experience with this and that

approach

...
is even better than trying to juggle DNS dynamically except for the case where you want to force clients to one

location

...
or the other. For example, you might temporarily have

local

...
routing problems at some location that make it impossible

to

...
connect to one site or the other that no other test could detect, and if the app has both IP addresses it can still

get

...
to the one that works.

Yes, that works when _you_ can _guarantee_ that all clients will talk _directly_ to the authority, or control intermedia cache/non-authorities that guarantee adherence to the TTL. That's why it works for intranets as well as Internet networks _you_ control.

Not true for the case of supplying multiple A records that don't change. The DNS servers/resolvers may change the order of the list but nothing else.

...

But everything changes when you have people who don't access the authority of the domain. And to rely on an application is rather arbitrary, especially how I've seen both the Windows resolver and MS IE act.

If you can find a repeatable case where IE does the wrong thing with multiple A records where some work and some don't please let me know. I don't claim to understand how it works but it seems very robust in those circumstances.

...

...
However, it only works for web apps and ones where you write the client yourself. The standard library 'connect' library routines will try one address and give

up.

Yes, which is why you can't trust it. Even if you do write it, you're making the assumptions. What if the service is not acting like you assume?

How can DNS not work according to the specifications at least at the 'A' record level?

...

DNS does not provide what it seems from the standpoint of different utilities (let alone versions), and Microsoft's ADS-integrated works very, very different to make matters worse.

It doesn't matter. Any dns server should be able to take multiple A records for one name and any dns client should get a list of addresses as the response. The client app just needs to know enough to try more than the first one in the list. Actually I think some versions of windows will try to figure out which to try from their route table but that doesn't seem very predictable.

-- Les Mikesell lesmikesell@gmail.com

Bryan J. Smith

8:21 p.m.

Les Mikesell lesmikesell@gmail.com wrote:

...

You are missing the point.

It's very clear both you and I are talking about 2 entirely different things. I don't disagree with many of the concepts you are covering, I know how round robin DNS works. But how these concept work with respect to high availability is what I'm taking major issue with.

...

The DNS server is irrelevant here.

It's _very_relevant_ if MS-RPC calls are being used and resolution changes from standard DNS at the _client_! That was my point!

...

In the dynamic scenario, you have a possible problem of cache admins configuring to use a minimum time of their own choice rather than following the spec, but that is rare. And it doesn't affect an unchanging list.

Sigh, you're picking and choosing the context you wish to discuss. When you're providing server failover, you can't rely on applications or DNS, but you must make the IP appear as the same.

On one site, that is doable with NAT -- be it 1-to-1 or destination, with additional considerations. Across sites you have to get far more involved. If, of course, assumes you're using stateless sessions (like HTTP), and changes radically (and NAT won't work) if you are using stateful sessions (like RPC, NFS, etc...).

You are _not_ going to address it with DNS. It might work for you if you can guarantee all systems hit the true authority, like you can on a LAN or corporate intranet. It might also work if you're using an extended DNS server that uses alternative services -- as as how ADS and MS IE interoperate with each other (yes, even when it "seems" you're using "stnadard DNS" you're actually not).

...

If you write the app you can trust it to work the way you wrote it and you don't have to worry about anyone's cache. That why I suggest doing it that way. Always give out multiple IP addresses and don't change DNS. Write the app

...

walk the list of returned addresses itself if the first one

...

tries doesn't respond.

We're talking about web services spread across 2 sites. What the heck does this context have anything to do with it?

...

This seems to already be done in the common web browsers.

Not the logic you're presenting, no. I think you're mega-oversimplifying things, and have the Windows resolver/MS IE logic _wrong_ on DNS -- other than the basics of how round robin works.

...

Not really. If you can't control the app you might have to live with this.

Is that _not_ the context of this _entire_ thread?

...

Not true for the case of supplying multiple A records that don't change. The DNS servers/resolvers may change the order of the list but nothing else.

Again, you're continuing to make the assumption on the applications used, and that they magically handle this logic as you want them to arbitrarily do so.

...

If you can find a repeatable case where IE does the wrong thing with multiple A records where some work and some don't please let me know. I don't claim to understand how

...

works but it seems very robust in those circumstances.

And I would differ on that assessment, very much so.

I often have to hack the Windows registry just to get MS IE to work correctly for corporate intranets, much less the Internet (with far more variables).

...

How can DNS not work according to the specifications at least at the 'A' record level?

Sigh, I'm not opening up that can of worms (don't get me started ;-).

I also think you're referring extended operations of ADS, and not DNS, with MS IE. When you think you're just doing simple DNS resolution, there are MS-RPC calls being made if you have ADS for your DNS and MS IE for your client.

...

Actually I think some versions of windows will try to figure out which to try from their route table but that doesn't seem very predictable.

Just about everything you have discussed here has been rather "arbitrary" and not very well understood.

As I mentioned before, I purposely have to hack the Windows registry (typically pushed via GPOs) just to get MS IE to stop doing so really stupid things on an intranet. I seriously doubt it works so perfectly as you describe over the Internet with its resolution -- quite the opposite.

The "hold downs" on various things are my biggest issue. Especially when it comes to non-availability.

Les Mikesell

9:29 p.m.

On Thu, 2006-01-05 at 14:21, Bryan J. Smith wrote:

...

It's very clear both you and I are talking about 2 entirely different things. I don't disagree with many of the concepts you are covering, I know how round robin DNS works. But how these concept work with respect to high availability is what I'm taking major issue with.

...
The DNS server is irrelevant here.

It's _very_relevant_ if MS-RPC calls are being used and resolution changes from standard DNS at the _client_! That was my point!

Then we agree. Don't change DNS.

...

...
In the dynamic scenario, you have a possible problem of cache admins configuring to use a minimum time of their own choice rather than following the spec, but that is rare. And it doesn't affect an unchanging list.

Sigh, you're picking and choosing the context you wish to discuss. When you're providing server failover, you can't rely on applications or DNS, but you must make the IP appear as the same.

Or let the client connect to it's choice of multiple IP's which can be in different locations.

...

On one site, that is doable with NAT -- be it 1-to-1 or destination, with additional considerations. Across sites you have to get far more involved. If, of course, assumes you're using stateless sessions (like HTTP), and changes radically (and NAT won't work) if you are using stateful sessions (like RPC, NFS, etc...).

The client app needs to know how to pick up after a failed connection if you want it to be transparent to the user. With stateless http that just means that you make another connection. Stateful sessions can sort-of be made to work if you mirror the session data between sites but that's probably going to break along with whatever takes the one site offline anyway. With anything else things will break unless the client makes the necessary steps to get back in sync. That's why this logic is best included in the client app along with the reconnect logic that tries the other address(s) that DNS provided. Even if you pretend that some other machine had the same address most apps aren't going be graceful about restarting their broken connections.

...

You are _not_ going to address it with DNS.

You can if you always offer distributed locations and let the client choose the address.

...

...
If you write the app you can trust it to work the way you wrote it and you don't have to worry about anyone's cache. That why I suggest doing it that way. Always give out multiple IP addresses and don't change DNS. Write the app

to

...
walk the list of returned addresses itself if the first one

it

...
tries doesn't respond.

We're talking about web services spread across 2 sites. What the heck does this context have anything to do with it?

Web browsers already do that.

...

...
Not true for the case of supplying multiple A records that don't change. The DNS servers/resolvers may change the order of the list but nothing else.

Again, you're continuing to make the assumption on the applications used, and that they magically handle this logic as you want them to arbitrarily do so.

If you write the app you can make it work that way. I agree that there are a lot of ways it can go wrong. Ours does it right, so it can be done...

...

...
If you can find a repeatable case where IE does the wrong thing with multiple A records where some work and some don't please let me know. I don't claim to understand how

it

...
works but it seems very robust in those circumstances.

And I would differ on that assessment, very much so.

And that repeatable case I asked for would be???

...

...
How can DNS not work according to the specifications at least at the 'A' record level?

...

Sigh, I'm not opening up that can of worms (don't get me started ;-).

How would any service work over the internet if you can't resolve A records?

...

I also think you're referring extended operations of ADS, and not DNS, with MS IE. When you think you're just doing simple DNS resolution, there are MS-RPC calls being made if you have ADS for your DNS and MS IE for your client.

I'm not sure what you are talking about. We have two colo sites with an assortment of web and proprietary services. No ADS in sight. I have F5 3dns boxes as the primary DNS servers but normally let them give out both addresses for all services, all the time. IE mostly just works. Our own client software takes care of failover using the addresses supplied by DNS. It has its own heartbeat on the server connection and will reconnect anytime it notices a problem with the connection, trying every address in the list. When it reconnects it refreshes certain things from the new server connection. If a site goes completely off line, the F5 will remove the address from the DNS list but that is mostly irrelevant to our own software which would ignore the failing address anyway.

...

As I mentioned before, I purposely have to hack the Windows registry (typically pushed via GPOs) just to get MS IE to stop doing so really stupid things on an intranet. I seriously doubt it works so perfectly as you describe over the Internet with its resolution -- quite the opposite.

Try it. If you are resolving names with netbios you might see something different. Put a name in dns that doesn't exist anywhere else to test it.

...

The "hold downs" on various things are my biggest issue. Especially when it comes to non-availability.

Being able to get all the addresses from multiple A records doesn't have anything to do with hold downs.

-- Les Mikesell lesmikesell@gmail.com

Bryan J. Smith

6 Jan 6 Jan

12:48 a.m.

Les Mikesell lesmikesell@gmail.com wrote:

...

You can if you always offer distributed locations and let the client choose the address.

The problem with that is it is too arbitrary.

...

Web browsers already do that.

I think we disagree there. And I think you are stretching some things to fit web browsers that are simply not true.

...

I'm not sure what you are talking about. We have two colo sites with an assortment of web and proprietary services. No ADS in sight.

Okay, no ADS. I was waiting for that confirmation.

...

I have F5 3dns boxes as the primary DNS servers but

normally

...

let them give out both addresses for all services, all the time.

Once again, you're looking at it from your perspective very close to the authority. That's completely different than any arbitrary user who may be several non-authoritative resolutions away.

...

IE mostly just works. Our own client software takes care

...

failover using the addresses supplied by DNS. It has its

own

...

heartbeat on the server connection and will reconnect

anytime

...

it notices a problem with the connection, trying every address in the list. When it reconnects it refreshes

certain

...

things from the new server connection.

Whoa! Whoa! Whoa!!!

You're talking about heartbeats and other "keep alives" that are not common to web servers with many, many clients from many, many web clients. You're almost approaching a stateful client/connection when you do such, along with the associated, added traffic.

So, again, your context is _very_different_ than what I understand the need to be here for generic web servers and browsers.

...

Try it. If you are resolving names with netbios you might see something different.

*SMACK* ;-> Right there, you don't understand a thing about how ADS-DNS works. No offense. ;->

It is _not_ NetBIOS. MS IE does some nasty stuff when it has ADS. MS IE does some stupid stuff when it doesn't as well.

Anyone who has maintained a very large enterprise network will tell you about all of the nasty and/or stupid stuff MS IE does for both intra and Internet resolution and requests.

I've had to write some really "fun" GPOs as a result.

...

Being able to get all the addresses from multiple A records doesn't have anything to do with hold downs.

You should read up on how the Windows resolver works as well as how MS IE operates both with and without ADS-integrated DNS. ;->

Les Mikesell

2:03 a.m.

On Thu, 2006-01-05 at 18:48, Bryan J. Smith wrote:

...

...
You can if you always offer distributed locations and let the client choose the address.

The problem with that is it is too arbitrary.

No, that's why letting the client decide is the best approach. Nothing else can know for sure whether a connection is possible to the given IP addresses.

...

...
Web browsers already do that.

I think we disagree there. And I think you are stretching some things to fit web browsers that are simply not true.

Have you tried the test I suggested yet?

...

...
I have F5 3dns boxes as the primary DNS servers but

normally

...
let them give out both addresses for all services, all the time.

Once again, you're looking at it from your perspective very close to the authority. That's completely different than any arbitrary user who may be several non-authoritative resolutions away.

Yes, I control it from the registered primary dns servers for the zone but the users are scattered over the world behind all sorts of intermediate DNS servers. That doesn't matter. You put 2 A records in the servers. The clients get 2 IP addresses. No amount of caching changes that.

...

...
IE mostly just works. Our own client software takes care

of

...
failover using the addresses supplied by DNS. It has its

own

...
heartbeat on the server connection and will reconnect

anytime

...
it notices a problem with the connection, trying every address in the list. When it reconnects it refreshes

certain

...
things from the new server connection.

Whoa! Whoa! Whoa!!!

You're talking about heartbeats and other "keep alives" that are not common to web servers with many, many clients from many, many web clients. You're almost approaching a stateful client/connection when you do such, along with the associated, added traffic.

Web just make a new connection whenever they need one. If there is a visible problem the user will punch the reload button to force it. Other apps tend to be stateful which is why you need to build in the logic to fix it when they reconnect. This will be the case even if you fudge the failover with expensive hardware tricks instead of making the app smart enough to do it on the client side.

...

So, again, your context is _very_different_ than what I understand the need to be here for generic web servers and browsers.

I think the original question was about web and other services. In the 'other' case it might be their own program where they can make it work.

...

...
Try it. If you are resolving names with netbios you might see something different.

*SMACK* ;-> Right there, you don't understand a thing about how ADS-DNS works. No offense. ;->

No I don't, but if you can't put in two A records and have any client's DNS lookup receive them (as demonstrated by the 'nslookup www.ibm.com' example) it is broken.

...

It is _not_ NetBIOS. MS IE does some nasty stuff when it has ADS. MS IE does some stupid stuff when it doesn't as well.

It can't be bad enough that other zone's A records disappear or you wouldn't be able to use the internet.

-- Les Mikesell lesmikesell@gmail.com

Steve Huff

5 Jan 5 Jan

2:50 p.m.

On Jan 5, 2006, at 1:10 AM, Tim Edwards wrote:

...

We currently have a backup site at a different location to our main site. This backup site mirrors (as closely as possible) our main services, particularly web serving. Is there a way to have the backup site act as a failover for the main site using something like Linux-HA? They are on seperate internet connections with different IP ranges.

take a look at Super Sparrow, which i believe does what you're trying to do:

http://www.supersparrow.org/

i haven't seen it widely deployed, but i have seen it occasionally in production. here's a paper on how it works:

http://www.supersparrow.org/ss_paper/index.html

-steve

--- If this were played upon a stage now, I could condemn it as an improbable fiction. - Fabian, Twelfth Night, III,v

7124

Age (days ago)

7125

Last active (days ago)

discuss@lists.centos.org

15 comments

6 participants

tags (0)

participants (6)

Benjamin Smith
Bryan J. Smith
John Hinton
Les Mikesell
Steve Huff
Tim Edwards