Distributed Systems Design — Part 1/4, posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design. Distributed Systems Design — Part 2/4 covered general guidelines for distributed programming.
Now that you better understand the factors involved in making your application run in a distributed manner when there are no immediate problems happening on the network, we will discuss making your application do the right thing when problems start happening.
One of the more common problems you are going to encounter with a distributed system happens when there are communication issues between sites. Unfortunately, when these happen the problem often lies in some router half-way between each of your sites. Here, each site-local cluster of machines is operating normally and may even be able to speak to a sizable portion of the internet. In this situation, continuing to operate both sites can be dangerous if a high degree of consistency is required for any portion of the data being shared between sites.
To those familiar with working with highly available clusters, the above situation is analogous to a partial-failure in the cluster.
Partial failures of a system are among the hardest events that a designer can plan for simply because there are literally an innumerable number of different ways a system can partially fail. As such, almost all highly available applications attempt to turn partial failures of components into much more predictable and workable complete failures of said components.
In a normal site-local highly available cluster of machines, the system ensures partial failures become complete failures through the use of the STONITH. STONITH stands for "shoot the other node in the head." When implemented correctly, it usually means each node in a (for example) two node cluster has direct access to the other node's power supply so if either server detects a problem with the other server, the still-functioning server can ensure the other node experiences a highly-predictable complete failure by physically powering off the other node. The link to the power supply is usually done via a highly reliable serial cable or trusted network segment, so that in practice in almost all cases the cluster designers never have to worry about the all-important link to the power supply failing in this set up.
Unfortunately, when one tries to use this same concept to eliminate partial failures in a cluster whose nodes are separated by over 2000 miles, the STONITH methodology does not work anymore. Nobody makes a 2000+ mile serial cable, nor would such a thing be reliable if anyone were to make it. Instead, each site must have some mechanism for detecting when it has been disconnected from the rest of the distributed system.
Thankfully, this problem has already been mostly solved. By using the tried-and-true methods already explored in cluster theory, it is possible for a site to detect when it can no longer reach over 50% of the other nodes in the cluster and voluntarily refuse to operate in a production role (ie. fence itself off).
The key here is to have:
Since forcing each of our customers running multi-site applications to also run their own cluster of servers spread across many data centers seems prohibitive when such are always going to be devoted to determining the same thing in most cases, Blue Box is in the process of deploying a simple "site-availability" service at each of its production data centers. This service will consist of the following:
We hope this service will provide enough automatic local intelligence to quickly determine whether a partition event is happening and which site has the best connection to the rest of the internet at the moment. Customers who want to use this service to determine local site and remote site connectivity to the internet will be able to query this service in order to make automated decisions about whether to trigger a failover event. Specific details on how this works will be forthcoming in another document.
The goal of the above site-availability service is to determine whether a given site has the minimal connectivity it needs in order to be considered connected to the rest of the distributed system. The above checks say nothing about whether a given site has enough bandwidth or server capacity to take on load, nor whether any individual client applications are in a state where they can safely run. These additional factors must be determined by the individual client applications themselves.
Of course, a partition is just one thing which can make a fail-over necessary. It is still important to query the status of the other site's application cluster in order to decide locally whether the local cluster should initiate a fail-over.
In any case, in the logic you write to determine whether to do a failover, we suggest the following:
The above should be used in formulating an overall "should I be running in production?" decision for the whole application. Making this into a simple command-line interface which can be used from cron jobs, backgrounded tasks, and the production site code when evaluating requests is an exceedingly good idea. All components of the site-local cluster should be using the same logic for this decision. This should also be tied into the health checks the external DNS provider uses for determining site availability.
What "fail-over" means in the case of a distributed system can vary a lot depending on which of the patterns above you implemented in your distributed system. In most cases, "fail-over" only has meaning for those business CAP theorem decisions where consistency is required. Specifically, "fail-over" means we switch which site is authoritative for that data (ex. where you're doing your sensitive inserts / updates).
We specifically want our logic to follow the idea of creating the predictable "complete failure" mode of the local cluster if the local cluster becomes disconnected from too great a portion of the internet. The worst thing that can happen in the case where consistency is required is two disconnected sites both think they control the authoritative source of information at the same time. Any scripts handling the failover should be written with this in mind.
For the following patterns, this is what "fail-over" means:
If the standby site has no state information (it is effectively a read-only site), then fail-over doesn't mean anything. Your application will have reduced functionality until the active production site comes back online. Please note it will potentially serve stale data, which is usually acceptable for most applications.
All "authoritative" sources of information move to the stand-by site. Steps should be taken on the "active" site to prevent an automatic fail-back (ie. fencing), as this will almost always need to be done carefully after engineers verify a synchronization step occurred.
All "authoritative" sources of information move off the local cluster which has detected it is disconnected from the rest of the distributed system. Automatic steps should be taken to ensure that a fail-back does not happen automatically unless this can be safely scripted (ie. fencing).
The fail-back is the most dangerous transition you will have to handle during the no-partition -> partition -> no-partition cycle. Almost all our planning goes into handling the fail-over event, where we go from a highly predictable no-partition distributed system to a highly predictable partitioned distributed system. The fail-back, by contrast, is less predictable because in almost all systems there will be a certain amount of cleanup and resynchronization that needs to happen before a once-disconnected site can safely be the authoritative source for information again. Not every site failure is the result of a partitioning event, so it is important the cause for the failure is well understood and corrected, and the important data is fully resynchronized before a safe fail-back can occur.
As with a fail-over event, the fail-back event has different meanings depending on which distributed design pattern was followed:
Fail-back has less meaning if the standby site is completely stateless. Administrators' entire focus will be to get the primary site back online as soon as possible.
By way of a generic checklist, a fail-back usually has the following elements:
In this pattern, fail-back may share many of the same elements as the Active + full-functionality standby pattern. The hope is in active + active, the resynchronization steps have enough automation around them to the point where automatic fail-back may be an option. Before a failed site is allowed to go back into production in this model, it must have all important data brought back into a consistent state with the other sites.
One of the few bits of information you can almost categorically trust to not diverge too quickly from a highly consistent state and remain available during a partition is your system's clock. Many of the algorithms used for automatic resolution of conflicting updates in eventually consistent systems rely heavily on timestamps being very accurate. With the ubiquity of public highly-accurate time sources on the internet, there is no excuse for not having your servers' clocks always kept strictly within a couple dozen milliseconds of atomic-clock accurate time.
Border Gateway Protocol (BGP) reconvergences often take between two and five minutes to complete, during which packets are routed sub-optimally and often lost. BGP reconvergence events are also constantly happening on the real-world internet. The more logical hops two geographically distant sites have to go through to get from site A to site B, the more likely inter-site communications are to get interrupted by this.
At the same time, a distributed system's fail-over and fail-back procedures can be both risky and expensive, and generally shouldn't be attempted under unexpected circumstances unless absolutely necessary.
On top of this, it should be noted cluster topology changes (ex. in the case of the use of the site-availability service) are not always detected by all nodes in a cluster at the same time. A standby site may detect the primary site has gone offline up to a minute before the primary site realizes this.
Due to these factors, and depending on your distributed system pattern, it may make sense to put a forced delay in the fail-over process, both to ensure is not a brief BGP reconvergence event and to make sure the primary site truly is offline before the standby site attempts to take over. The detection of a partition event starts a countdown to the point of no return when the standby site goes into production.) Especially in the case of BGP reconvergence, it is often less expensive and risky to deal with a less than five minute outage in your distributed application due to external network factors than to trigger an expensive and risky fail-over process, which may take five minutes or longer to complete anyway, depending on exactly what you are doing.
If you have done a decent job of keeping your various application components separate, you may find it beneficial to have some components of your application follow one of the design patterns above while others follow a different pattern. As far as the fail-over and fail-back transitions (and even the working model of these components) it can be beneficial to think of them as separate applications. Even if you use one pattern throughout your application, you will probably have some components which favor consistency and some which favor availability.
Using different DNS sub-domains provides a good logical separation between these components. It gives you the ability to do fail-over and fail-back differently on a per-component basis at the DNS level. Plus, keeping everything under a single top-level domain means you can still share cookies and security features between components.
Even the most simple distributed application is going to have fairly complicated fail-over and fail-back procedures. Especially for those parts of the process which require manual intervention, it is an excellent idea to write down a few simple-to-understand checklists covering these kinds of transitions. This will ensure that when you or your team members actually have to do the process during a real partition event, you do not have to try to remember everything that needs to be done from memory.
These checklists will also become the scripts you follow in testing your fail-over and fail-back plans. They become the summary of what exactly in your process you need to look into automating as much as possible. You should plan on sharing your checklists with all the members of your team and vendors who may need to be involved in a fail-over or fail-back. A good checklist also needs to be able to pass the 3:00am test. That is to say, it must be direct and simple enough to follow by anyone on your team with sufficient access who is running at 10% capacity (having been woken up at 3:00am after minimal sleep).
Even if you are moving toward fully automated fail-overs and fail-backs, we still recommend keeping functional checklists up-to-date describing a manual equivalent of the automated process. This will not only help you to document what the automated processes are supposed to be doing, but will also help you know where and how you can intervene if an automated fail-over or fail-back process breaks down because of a failure scenario you didn't anticipate.
*All references are provided at the end of Distributed Systems Design — Part 1/4.
Have you read the news?
Rumor has it this cloud thing is taking off.