Stephen Balukoff, BBG’s Principal Technologist, was surprised by the lack of practical information available on the subject. He authored this document to help fill the gap and hopefully start an exchange of information within the community.
In recent months more and more Blue Box customers have been asking about the possibility of setting up their applications to be distributed across two or more geographically distant data centers. At first glance this seems like it ought to be easily accomplished by modeling the distributed system after a cluster with no single point of failure on a local LAN In practice, however, there are a lot more concerns which need to be addressed in a geographically distributed system. As such, creating a well-functioning distributed system is probably a lot more difficult than you think.
This document was written with the goal of giving you a place to start to understand the concepts and concerns involved, as well as to give some practical advice as to "where to start" if you are trying to turn an application which exists happily in a single datacenter into an application which exists happily spread across two or more data centers. Unfortunately there is no panacea here which can accomplish this for any given application for reasons which will become evident below. Educating oneself as to the rules of the game and suggested best practices are the first steps in avoiding becoming a casualty. This document was also not intended to be a comprehensive guide on designing distributed systems.
In the interests of full disclosure it should be noted the authors of this document, while very skilled in systems administration, network and cluster design, and application design and support, do not to claim to be experts in distributed systems design. Distributed systems design is a relatively new area of computer science, and there are actually few companies in the grand scheme of things who have successfully engineered large-scale distributed systems. Those who have tend to be fairly tight-lipped about it, guarding their trade secrets in this arena. There are certainly more intelligent and experienced people in this arena than us, and we encourage the reader to examine other sources of information when formulating a plan for implementing a distributed system.
The sources of information used in formulating this document range from various whitepapers available online, wikipedia articles, scientific research papers, consultation with other experts in the field, experience gleaned from working with customers who have implemented such systems, our cumulative understanding of the implications of all of the above, and the occasional educated guess. In general, we've found that there are very few practical how-tos available for anyone looking to create a distributed application (especially if they're starting from an non-distributed application). We wrote this document in the hopes it would be helpful to anyone else facing this problem but do not guarantee the accuracy, practicality, or any other value to the information contained herein.
In their most generic form, the basic design requirements of any distributed system can be summarized as follows:
Changes to data in one part of a distributed system are immediately represented throughout the whole of the system. For example, if I add a pair of shoes to my shopping cart by sending a web request to a server in Seattle, if my next request ends up going to a server in Virginia, I expect the server there to know about the pair of shoes in my shopping cart.
Every request sent to a distributed system should get a response, no matter which individual server in the distributed system I happen to be talking to. Continuing the above example, it should not matter whether I sent my request to a server in Seattle or Virginia, I expect to eventually get a response back.
The distributed system should still be able to function even if arbitrary messages are lost in the system. Again, continuing the example, even if the servers in Seattle and Virginia are not able to talk to each other, the system should still work.
While the above are arguably essential requirements for any web application regardless of whether the servers which make up the system are located within a single data center or spread across the whole internet, the problem with the above is that it turns out it is impossible to guarantee all of the above for any system. As the saying goes, “Pick any two. You can’t have all three.”
Specifically, in a keynote speech to the Association for Computing Machinery in 2000, Eric Brewer conjectured it is impossible for a distributed web system to guarantee both consistency, availability, and partition-tolerance at the same time. Two years later, two very smart computer scientists at MIT named Seth Gilbert and Nancy Lynch formally, mathematically proved this.
It is important to understand the magnitude of impact of that proof. It formally establishes CAP theorem as a theorem: This means it speaks to the fundamental nature of the universe in which we live, just as much as 1 + 1 = 2, or the law of gravity, general relativity or the speed of light. When we say "pick any two, you can't have all three," we’re not trying to be glib. It is physically impossible for us or anyone to guarantee a distributed system will be able to guarantee both consistency, availability and partition-tolerance all at the same time.
It is actually a little worse than that. In real world applications (or until quantum entanglement sees a real computing application), one cannot choose to never have partitions. Network failures happen, bugs in the system happen, user error happens, and there will always (and usually frequently on globally distributed systems) be times when server A and server B in your network are not going to be able to talk to each other. In a real-world distributed system one of the choices is always going to be made for you. Again, blame the universe if you don't like this. Of the two choices you get, partition-tolerance has to be one of them.
So what it really boils down to is: for your business application, which is more important to you, consistency or availability? When a partition happens, you are forced to choose one or the other. It is physically impossible to have both.
The good news is you do not necessarily have to choose between consistency and availability for your entire application all of the time. For example, it may make sense to choose consistency for those site features which are most sensitive (ex. bidding in an auction, modifying account balances, etc.) but choose availability for those site features which are not very sensitive (eg. browsing items in a store, reading forum posts, etc.).
Further, for brief partitions (ex. communication interruption that only lasts a couple of minutes), it may make sense to live with inconsistency for a short period (perhaps operating on the assumption that data won't get very out of sync very fast). Then later you can choose to drop availability if the partition event lasts too long. It all depends on the nature of the application and the business decisions dictating appropriate application behavior.
The choice of consistency versus availability is a business decision. Therefore, the appropriate action that a given cluster or node within a cluster should take during a partition event needs to be dictated by business logic. Partially due to the fact that businesses must choose for their application to behave in different ways according to their business needs-- and that even different parts of any given application in a distributed setup may need to behave differently depending on the specifics of the circumstances-- this means that handling a partition event gracefully is necessarily much more complicated than typical fail-over scenarios which occur within a single site or localized cluster.
The only exception to the CAP theorem is the one case where consistency doesn't matter. If your application is able to run completely stateless (ie. no data preserved server-side of any kind), then by definition no data exists which needs to be kept consistent across the distributed system. Technically, this is not an exception to CAP theorem. It is the one case where one can safely choose availability and partition-tolerance all of the time.
Before we go too much futher into the implications of CAP theorem, we need to say something about eventual consistency. In particular, one clarification about the proof of CAP theorem is that by 'consistency' what is meant here is "atomic consistency." The theorem allows for the possibility of a thing called "eventual consistency" (or "weak consistency") which for some applications is a better alternative than losing availability during a partition event. There is an implied business decision with eventual consistency configurations to sacrifice consistency in the short term in favor of preserving availability. The idea behind eventual consistency is that partitioned servers in a distributed system are allowed to become inconsistent during a partition event, and that there's pre-defined business logic within the software to resolve the data inconsistencies and conflicts this creates once the partition is resolved.
The problem is the devil is really in the details here. Further research into the topic of eventual consistency shows that for any given algorithm here, it may not be possible to resolve these inconsistencies in a reasonable amount of time (if ever). Plus, "partitioning" here is not always a total communication failure.
One other important thing to point out here is that up until now we treated partitioning as an "event" in the network. The definition of what a partition is, however, can be more broadly defined in real-world (i.e. mostly asynchronous) networking environments as the time it takes data to propagate across all the servers in the distributed system, if such systems are designed to retransmit dropped messages. During this window, depending on the specifics of how data is transferred, implied business decisions have been made about consistency vs. availability.
For example, suppose I have database server A in Seattle, and database server B in Virginia which are in a MySQL master-master replication configuration. Typical asynchronous MySQL replication is actually one form of "eventual consistency". There is a significant amount of time between when an update to the data on server A propagates to server B. The best case scenario is around 80 milliseconds, worst case is never-- if server B is unable to process data at the same rate as server A and is continuously falling behind in replication. (See the above section about the half-truth of eventual consistency). Between the time an update to the data is made on server A, and the time when server B processes the same update, one could say server A and server B are effectively partitioned. Furthermore, because they will return different results if a client makes a query to each server during this window-- but each server *will* return a result-- this means that consistency has been sacrificed in favor of availability (at least until that update makes it through). The business decision here was made when the choice was made in favor of using MySQL asynchronous replication.
Take the same scenario above but replace MySQL master-master replication with PostgreSQL master-slave replication in synchronous mode. In this case, the PostgreSQL synchronous replication system, with its acknowledgements and ties into transactions, ensures both server A and server B strictly have the same data set at all times. If we update data on server A, during the window in which this is replicated to server B, querying the data from server A will show the same thing as server B. The update to the data has not yet been applied. In fact, the update query will hang indefinitely if server B happens to be down. In this way, one can see the business decision has been made to sacrifice availability for consistency.
Given the above examples, if you are starting to understand some of the implications of the CAP theorem, you should start to see how it applies to all kinds of situations in various information networks, ranging from CDN caches all the way down to the write-back versus write-through buffer in a RAID array. In particular, CAP theorem applies on a local LAN within a cluster but we largely ignore it there simply because partitions at this level are rare enough, and latency-related partitioning is so small, that the effects can mostly be ignored. CAP theorem applies whenever there is more than one store for information where the possibility of partitioning exists.
There is one more characteristic of CAP theorem that occurs when applied to real-world technology. Until now we implied implementors of a given distributed system always get a business choice as to what they want to sacrifice when partitioning becomes an issue. In reality, some implementations of technology are incapable of allowing this choice in the event of a partition. The nature of the software is such that it makes the choice of availability or consistency for you.
Along with the implications of CAP theorem as applied to a geographically distributed system, there are a few other common pitfalls for anyone embarking on this road for the first time. Please note all of these assumptions are false and must be accounted for in any distributed system:
The corollary to the above fallacies are the truths about networks in a distributed environment:
Everything we have discussed thus far leads to the following best practices for designing a distributed system:
Every component of your application is going to fail at some point, especially those components which communicate over the widely-distributed network. It is a good idea to write down the most common failures which will happen in your application and design acceptable fallback behaviors for those periods. It is also generally better to lose partial functionality of your application than the whole application when failures happen. In other words, fail gracefully.
The less data you need to share between sites, the fewer opportunities there are for failure. Also, given that bandwidth limitations may occasionally happen, being minimal here can help a lot. Traffic traversing long-haul links is much more expensive than local traffic.
The less data needing to be preserved between requests to your application, the less needs to be transferred between sites. For every bit of state-ful data that needs to be shared between sites, a business CAP decision needs to be made about how the application behaves when a partition happens. The fewer of these you have to deal with, the easier time you will have dealing with partitions. Using client-side session stores can help. It should be noted, though, that some browsers / proxying solutions / web servers will limit the size such client-side data can be. In any case, a client-side session cache will be less reliable and more subject to security problems than a server-side cache under normal running conditions.
There are many reasons to maintain separation of application components even outside the context of distributed systems. Reasons range from better testability, modular functionality leading to code reuse, more predictable code behavior in failure scenarios, easier to find and fix bugs, better control and logging. In the context of distributed systems, modular systems are far easier to make:
You simply cannot trust the internet like you can a local network. Back-filling in a security solution after a breach occurs can often prove to be extremely difficult and costly. It is better to start secure and stay there. This can be achieved through the use of a VPN, though we generally recommend simply setting up self-healing / self-spawning SSH tunnels or application-specific authentication and encryption between systems as this usually scales horizontally much better and eliminates the VPN as a single point of failure and bottleneck for an entire data center of machines.
With the goal of minimizing cross-site traffic, caching anything which has to go over this link is generally a good idea. In a situation where a partition happens, it is often a good idea to flush caches.
In a partition event, most applications work well with a certain degree of inconsistency in order to preserve availability. This is most obviously done with transactions which are essentially read-only. In order to preserve the illusion of consistency for the client for write-type transactions during a partition, using asynchronous processing is often the only viable solution. It is better to simply design your application to use these systems even without partitions so that operating with a partition more closely resembles "normal" running conditions.
There is a significant amount of network latency that applies when transmitting data between geographically distant sites. Blame the speed of light for this. In any case, doing 50 individual request-response transactions over such a link will take a lot longer than batching all these requests together and getting the responses back in one large operation. If this can be done asynchronously, all the better.
Idempotency in the context of algorithms, methods, etc. means you can run the same thing many times without changing the outcome any more than running that code once would have done. Especially if you have to design your code to deal with eventual consistency, it is very likely you will end up accidentally running the same procedure on the same bit of data more than once at some point in your application. This can be extremely bad for business especially in the case of monetary transactions.
All the extra work / automated failover plans you will come up with are virtually guaranteed not to work if they are not tested. Unfortunately, it is often very difficult to simulate in a lab the kinds of network failures you can get in a distributed system. It is always a good idea to set up a fully-testable staging version of the application on which this kind of testing can occur if your budget allows for it. The next best alternative is to force "real" failures in the production environment (eg. by shutting down certain vital systems) under more controlled conditions and in windows least likely to cause your business and your clients problems if the failover routines do not work as expected.
As a programmer or designer making distributed systems, you will make some decisions which will be manifested in code or strange looking cluster components which will not make sense to an engineer used to thinking about the problems in terms of single-site mentality. The surest way to make sure your successors end up repeating all your painful learning experiences is to document none of your hard-won but somewhat strange-looking decisions.
To continue with this series check out Distributed Systems Design — Part 2.
The following are some of the more helpful documents consulted when writing this paper:
Have you read the news?
Rumor has it this cloud thing is taking off.