Distributed Systems Design — Part 1/4, posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design. Distributed Systems Design — Part 2/4 covered general guidelines for distributed programming. Distributed Systems Design — Part 3/4 discussed partitioning, fail-over, and fail-back events.
This section of the document discusses several commonly used pieces of software within site-local clusters and their applicability in a distributed system, as well as their CAP theorem-related predispositions.
A MySQL master-slave relationship, like all simple master-slave relationships, is capable of having only one authoritative "master" at a single site. MySQL's standard replication procedure follows an eventual-consistency model, meaning the database remains available during a partition event, but the data between master and slave will temporarily be out of synch until the partition is corrected and the slave is able to catch up to the master.
A MySQL master-master relationship set-up is actually capable of running in active-active mode with a couple of caveats:
In any case, as part of the health check of a given site, you should make sure to monitor the status of the replication relationship. Any site whose master either gets disconnected (when a partition isn't happening) or too far behind the other master(s) in the relationship should generally return a 'fail' to health checks of the database.
The active-active pattern for distributed systems works so long as any sensitive updates to the database only ever happen on one master in one site at a time.
At the time of this writing, PostgreSQL built-in replication models are not capable of running in master-master mode. There are some technologies,like proxy solutions, which are capable of making a series of PostgreSQL servers act as if they are in a master-master relationship but we suspect they do not react well to partitioning at this time.
PostgreSQL 9's standard replication model (ie. streaming WAL updates) follows the eventual consistency model much like the MySQL master-slave relationship does. PostgreSQL 9 is also capable of running in a synchronous replication mode, which favors consistency over availability. This is usually not desirable for most applications, as a partition means update queries will hang indefinitely (ie. become unavailable).
The active-active pattern for distributed systems works with PostgreSQL only insomuch as all insert / update queries happen on one server in one site at a time. Network latency for remote sites becomes a serious factor with this model.
Memcache has no built-in concept of replication. Any kind of data synchronization between sites must happen in scripts or in-application functionality you write. The application developer has complete control over the consistency versus availability question when partitions happen. If consistency is required, we recommend flushing the memcache on the fail-back event in any case.
As none of Apache, nginx, passenger or unicorn inherently have any concept of state built in, "consistency" is not a problem. These technologies are perfectly safe to use without having to worry about fail-over and fail-back transitions.
Having said this, it's actually extremely unlikely your application code or external modules used with Apache or nginx don't use any kind of state information. While the above pieces of software in their simplest form don't need state, your application is probably using state to do what it needs to do. It is not so much what these pieces of software do so much as what they probably touch.
DRBD enforces a synchronous update mode until the other node in the DRBD pair is thought to be down. At this point, the surviving node assumes it is the only survivor and goes into "master" mode, if it wasn't already there.
Taken in light of a distributed network which experiences partitions, without some other form of intervention, the DRBD servers will go into "split brain" mode which is nearly impossible to resolve without simply dumping one half of the cluster or the other. Locally, we use direct crossover cables to minimize the chances of unexpectedly entering split-brain mode. This does not work over a 2000-mile network.
The other main feature of DRBD is it requires a very large amount of network bandwidth to do synchronization between servers in the pair.
Due to these reasons, an inter-site DRBD relationship in a distributed system is not recommended in any circumstance. Instead, we recommend evaluating other solutions which are capable of doing eventual consistency. These include Amazon S3, rsync between site-local DRBD services, application code to push updates to both sites in an asynchronous manner, Hadoop, and Riak.
Since HA-NFS solutions utilize DRBD disks in the underlying technology stack, the same restrictions apply to them as DRBD.
CDNs and other caching technologies are great solutions to use in all cases where strict consistency is not necessary and when serving slightly stale data is not a showstopper for your application. The nature of any cache is to favor availability over consistency. It is generally a good idea to flush caches as part of the fail-back process.
Redis' replication model resembles a simple master-slave relationship. You will more than likely treat a Redis store exactly as you would a PostgreSQL master-slave relationship.
The problem is Redis is generally used for its speed, and therefore the added latency when used with the active-active pattern generally becomes a showstopper for the single-master Redis approach. If you need to run using this model consider using asynchronous processing and a caching layer, which, yes, basically defeats the purpose of using Redis in the first place.
Since most of these types of systems were designed to work in a distributed system, they usually have the ability to adapt well to being used in… a distributed system. Said another way, in going through the work to make your application work around some of the bizarre restrictions and behaviors of these systems, you're necessarily going to make your application more tolerant of the eventual consistency model.
The important thing to note about these technologies is they are almost all universally written with the idea that availability is far more important than consistency. They all follow the "eventual consistency" model. As such, if strict consistency is required for any aspect of your application, you should not use one of these technologies for that particular data.
The exception to this is the eventually consistent models which allow for the idea of a "strictly consistent" read or write. These are generally far slower to do, and will favor consistency over availability. The ability to do strictly consistent operations will hang or return an error when a partition exists.
Beyond this, it is very important you understand the specific foibles and restrictions of these technologies, especially if an entire data center of cluster nodes is unexpectedly partitioned from the rest of the distributed system. Pay very close attention to how these technologies handle the fail-back transition and remember cross-continent bandwidth is a precious commodity.
Mongo has some very interesting capabilities when it comes to data replication and distribution. In general there are two modes of operation.
(Legacy) master-slave mode acts like any other database in master-slave mode and the same restrictions apply as apply to PostgreSQL master-slave or Redis master-slave. If your data set is massive, as is almost always the case for Mongo, beware the enormous bandwidth hit when resynchronizing as part of the fail-back event.
The newer, recommended form of Mongo replication involves using "replica sets." This allows for automatic sharing, scaling, failure detection and correction, redundant storage of data, and other features. It even has the ability to be data center aware so no data center is completely cut off from data during a partition event. In reality, this is still a form of master-slave replication, meaning all writes to the database still need to pass through an elected master server . This master can live in either data center when there is no partition. Again, because of latency realities in a distributed system this can lead to inconsistent performance.
Due to the way election happens in a Mongo replica set, writes can only happen when quorum can be reached. This reveals the implied business decision of consistency over availability, meaning for writes, at least, this is not an eventually-consistent model. Having said this, it appears from the Mongo documentation that the developers of this software have largely solved many of the complexities that need to happen in a Mongo replica set when a fail-over or fail-back event happens. (We have no operational experience here.) In any case, you still need to beware of the massive bandwidth hit that can occur when the fail-back event happens.
Reads with Mongo are "eventually consistent" unless you specify strong consistency for that particular read.
The one other major caveat we feel we need to mention is we have never actually seen a successful replica set implementation. This is not to say it does not work well, but given Mongo's history of losing entire data sets upon server crash, our recommendation is to ensure you are backing up your data often if you care to preserve it.
There is nothing inherent in the Rails framework which makes it unsuitable for use in a distributed system but you should be aware of some of the trade-offs in the system, in which designers have made CAP theorem decisions without distributed systems in mind, that can cause major headaches if they are not understood.
The biggest of these tradeoffs is probably that in an effort to make the database back-end flavor agnostic, the Rails developers chose to give application designers a way to validate the uniqueness of certain fields outside of the context of the database itself. It is important to note insistence on uniqueness implies a business decision for consistency over availability. If your back-end is eventually consistent the results of this check might not actually be able to return trustworthy results.
You need to be very careful around such conventions if you are making an active-active distributed system.
Having read this far, you understand the main purpose of this paper is to attempt to give you the background you actually need in order to start thinking about how to make your single-site application work well in a distributed environment. Once you feel comfortable with all of the above concepts, and assuming you are familiar enough with your own code base to understand the implications of the former light of the latter, I recommend the following strategy with regard to where you should start:
As far as that plan is concerned, we believe the above patterns suggest a logical progression. As such we suggest the following broad outline:
If going from pattern zero to fully manual fail-over:
If going from pattern zero to active + reduced functionality standby:
If going from active + reduced functionality standby to active + full-functionality standby:
If going from active + full functionality standby to active + active:
We hope you have found this series on Distributed Systems Design interesting and informative. All further reading resources can be found at the end of Distributed Systems Design Part One.
Have you read the news?
Rumor has it this cloud thing is taking off.