Distributed Systems Design — Part 1/4, posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design.
Now that we have gone through the fundamental concepts and have familiarized you with the key differences you need to think about when programming and designing for a distributed system, you are probably wondering how you take your application which works well in a single cluster in a single datacenter and make it work well in a distributed fashion. This section of the document is aimed at giving you some pointers to get started.
Note that there are a lot of different common design patterns that are used here (and not every pattern is covered below). Which is right for you will depend a lot on some more business decisions revolving around budget, willingness to alter code, and perceived cost of down-time / reduced functionality during a partition.
You will have absolutely no ability to get your application to work in a failover or partitioned environment if you cannot get it to work when the network is not having problems. As such, we will discuss achieving this before we talk about handling partition events.
We highly recommend ensuring any given site-specific cluster which makes up part of your distributed application is fully redundant, with no single points of failure. A local failover due to a failing piece of hardware is far easier to deal with than a complete failover to a remote site due to a failure in a single-point of failure on the local network. The most dangerous times for a distributed application happen during the fail-over and fail-back stages, so minimizing the number of times you have to actually do this is an excellent idea.
There are a few standard patterns which get used here, and a couple variations of these patterns that should be discussed. Please note we will devote the majority of the discussion below toward patterns where fail-over events are automatic.
Pattern zero is not actually a distributed pattern at all. Rather, it is the term we are giving to the minimal cluster design we recommend you achieve at a single site before you can realistically start to contemplate creating a distributed application. The key characteristics to pattern zero are that all or almost all single-points of failure within the cluster design have been addressed. Also, regular off-site backups are being made of all of the data you would need to set up a similar cluster elsewhere.
It should be noted that unless you are making regular off-site backups of your key data, you are taking a business-continuity risk. If a major data-destroying disaster occurs at your data center without off-site backups, there will be no way to restore the lost data. If you are dependent on your application data to do business (as most are), this could spell doom for your business.
This pattern's characteristics are:
In a fully-manual fail-over pattern, enough of an application footprint is kept in standby at a remote data center to resume some semblance of normal business operations in a relatively short amount of time. However, all the steps for transitioning to the standby data center are managed manually. This is the most basic system design which can be considered distributed. Except for the fact that fail-over and fail-back happens only with administrator intervention, this pattern can otherwise closely resemble either the active + reduced functionality standby or active + full functionality standby patterns discussed below.
If your goal is to turn a single-site (pattern zero) application into a distributed application with automated fail-over, designing the system first for fully-manual fail-over can be a logical step in the process.
In coming up with your design, we highly recommend you consider each task which needs to happen for a successful fail-over or fail-back event and develop detailed checklists or scripts for administrators to follow in case a manual fail-over becomes necessary. These scripts also become your guide for which tasks you need to automate, if you want to ultimately achieve some degree of an automatic fail-over pattern.
This pattern's characteristics are:
In the active + reduced functionality standby pattern, an external DNS service is used which has the capability of doing health checks against the full-production site It will automatically send traffic to the standby site if the health checks detect a failure in the production site. Under normal running conditions, all traffic gets sent to just the primary / active / full-production site. The standby site itself can range from as simple as a single service which displays a "Site is down right now" message to clients, all the way up to something closely resembling production with only the most sensitive features of the site disabled.
In general, with this pattern, we recommend concentrating on making the standby site do as much as possible without requiring state information from the production site. In other words, the standby-site runs in "read only" mode. This way, failing over to the standby site and failing back can happen without much intelligence needing to be built in around synchronization. Most of the thinking / logic that has already gone into designing an application which functions in a non-distributed way does not need to change since the standby site is essentially stateless.
In a nutshell, this pattern's characteristics are:
In the active + full-functionality standby pattern, an external DNS service is used which has the capability of doing health checks against the active site. It will automatically send traffic to the standby site if health checks fail against the active site. Under normal running conditions, all traffic gets sent to just the active site. The difference between this pattern and the one above is the standby site in this case synchronizes state information with the active site so that functionality is not reduced when fail-overs happen.
In this pattern, significantly more intelligence needs to be built around the fail-over and fail-back functionality. Most of the work in getting the fail-over to happen will be in making sure the right data is being synchronized in the first place. If this happens correctly, then fail-overs tend to be pretty simple. Otherwise, this pattern also allows software engineers to mostly use much of the same code and thinking that went around single-site application programming. This is because the normal situation is only one site will be active at a time and, therefore, less state data needs to be constantly or immediately shared between sites.
The fail-back process can be fairly involved depending on the specific technologies used in the the application in the first place. We generally do not recommend fail-backs to be an automated event in this environment.
This pattern's characteristics are:
In the active + active pattern, production site traffic is sent to all sites making up the distributed application at all times. It is the most difficult pattern to implement when coming from a legacy application written for a single-site environment. It is also the one most capable of dealing with network instability. This is the pattern all the big guys out there use, including Google, eBay, and Amazon. It is often the one most clients would like to achieve with their distributed application.
As we can send traffic to many sites at the same time with this pattern, it opens up a number of other interesting options with regard to Global Server Load Balancing (GSLB). As with the other patterns discussed here, an external DNS service capable at least of doing health checks is required. With this pattern we may be able to use more advanced features of an external DNS service as well.
With the active + active pattern, a lot more work generally needs to be done within the application to get it to behave correctly in this environment. Specifically, work needs to be done, or at least evaluations made, around every point in the application where state data is touched.
This pattern's characteristics are:
If your failover strategy includes spinning up the standby site on the fly during a partition event, please note we generally do not recommend attempting auto-scaling the standby site for the following reasons:
When considering how to make your single-site application go to an active + active distributed pattern, you need to concentrate on two key areas, stateful/shared data and cache usage.
Any time you touch cache, stateful, or shared data in your application, you need to make the CAP theorem business decision around that interaction.
Ideally, you want all aspects of your application to be able to operate with an eventual-consistency model, at least to a limited degree (because of the effect of latency acting like a partition). Where consistency of a given piece of data from one node of the cluster to another is not absolutely necessary you want your code to be able to treat the data like it can be at least partially inconsistent.
To help you make the distinction of what you are already doing in your code around the CAP theorem decisions, consider this:
Those pieces where the second point above apply need to be very carefully evaluated to see whether consistency truly is required here or whether asynchronous processing can be used at all.
We suggest approaching these problems as follows:
When you can’t sacrifice consistency, active + active can still work when partitions don't exist. For example, if your "master" PostgreSQL server exists in Seattle, under normal running conditions your application servers in Virginia must connect to the "master" PostgreSQL server in Seattle (over an SSH or VPN tunnel) in order to perform these sensitive operations. Try to minimize the occurrence of this kind of activity as the latency of the coast-to-coast link will already add a lot to the response time of your Virginia application servers.
When partitions happen, you must be prepared to sacrifice availability of those site features where consistency is absolutely necessary.
*All references are provided at the end of Distributed Systems Design — Part 1/4.
Have you read the news?
Rumor has it this cloud thing is taking off.