DNS round robin for web server failover
DNS round-robin for Web server failover
Because we were concerned about reliability of our webservers, we elected to experiment with the following procedure to "round-robin" between our two servers. After 7 months of service, we have seen no downside, in spite of much agitation by knowledgable people that DNS was unsuited for this task. We are particularly interested in what problems those experts anticipated - in our discussions they were not able or willing to articulate the exact problems they forsaw.
What we did
We added a second A record to www.example.com pointing to the present backup webserver:
Successes with modern browsers
Using recent versions (2009) of MSIE (version 8), Opera, Safari 4.04, Firefox (3.4), Chrome and Konqueror the worst result was a delay of about 30 seconds before the browser elected to retry at the other IP address and loaded the page. Other than the pause, the process is user-transparent, and occurs only if the first server tried times out, and only for the first page requested from our site in any browser session.
Failures with obsolete browsers
I had access to a few older browsers such as FF 2.0 and Lynx that never switched. I was able to test Safari 4.0, FF 2.0 and Chrome 3.0 at Adobe Browserlab, and those also failed to switch. So I can't say exactly when the ability to switch to a working A record was added to each browser, but it has been added.
Is there a downside?
During periods when one server was down, users of non-switching browsers would have a 50% chance of getting the bad server in an individual browser session, but the chance of one of two servers being down is about double the chance of one server being down. This is close to a wash then, for the older browsers and a pretty big win for the newer ones. It is true that a user with an older browser could close his browser and wait 5 minutes (our DNS TTL) for another chance, but probably most users wouldn't do that.
This does split our logs over two systems, but that has not been a problem for us, and could be addressed if it were. You might think that this is something better handled by SRV records. We and some others agree. But browser authors have resisted SRV records, and round robin A records have improved our reliability greatly without introducing any new hardware, software or single point of failure, and without much complication of our configuration either. So we like them.
Thirty seconds is much longer than necessary. See: this posting from the ISC
Comments?
A Czech translation of this article is posted here
Daniel Feenberg
feenberg@nber.org
11 October 2010