NEWS // Blog

RESOURCES

Distributed Systems Design - Part 4/4

Share on Twitter Share on Facebook Share on Google Plus Share on LinkedIn

Blue Box is proud to present this series on understanding the basics of distributed systems design.

Distributed Systems Design — Part 1/4, posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design. Distributed Systems Design — Part 2/4 covered general guidelines for distributed programming. Distributed Systems Design — Part 3/4 discussed partitioning, fail-over, and fail-back events.

Specific technology readiness for use in distributed systems

This section of the document discusses several commonly used pieces of software within site-local clusters and their applicability in a distributed system, as well as their CAP theorem-related predispositions.

MySQL

A MySQL master-slave relationship, like all simple master-slave relationships, is capable of having only one authoritative "master" at a single site. MySQL's standard replication procedure follows an eventual-consistency model, meaning the database remains available during a partition event, but the data between master and slave will temporarily be out of synch until the partition is corrected and the slave is able to catch up to the master.

A MySQL master-master relationship set-up is actually capable of running in active-active mode with a couple of caveats:

  • You must have auto_increment_increment and auto_increment_offset set correctly or replication will almost certainly break quickly and un-recoverably.
  • You must examine your schema for any unique non-auto-increment indexes. If uniqueness of the field is absolutely necessary, it implies a business decision favoring consistency over availability. This means you can only do inserts / updates to this field on one authoritative master server across your whole distributed system at a time.
  • Be careful about "update" queries. The last update wins and, therefore, it is possible for data to get out of synch between sites even under normal running conditions due simply to normal network latency. In general, updates should only be run against a single authoritative master in the distributed system at a time.
  • Despite all of the above, it's still entirely possible to get the masters out of synch in subtle ways if arbitrary updates/inserts are being run on both masters at the same time. In general the rule of thumb should be to do write-type queries on a single master in the distributed network at a time and the exception to do inserts / updates anywhere when you can be absolutely certain that data inconsistencies here will not adversely affect your application.

In any case, as part of the health check of a given site, you should make sure to monitor the status of the replication relationship. Any site whose master either gets disconnected (when a partition isn't happening) or too far behind the other master(s) in the relationship should generally return a 'fail' to health checks of the database.

The active-active pattern for distributed systems works so long as any sensitive updates to the database only ever happen on one master in one site at a time.

PostgreSQL

At the time of this writing, PostgreSQL built-in replication models are not capable of running in master-master mode. There are some technologies,like proxy solutions, which are capable of making a series of PostgreSQL servers act as if they are in a master-master relationship but we suspect they do not react well to partitioning at this time.

PostgreSQL 9's standard replication model (ie. streaming WAL updates) follows the eventual consistency model much like the MySQL master-slave relationship does. PostgreSQL 9 is also capable of running in a synchronous replication mode, which favors consistency over availability. This is usually not desirable for most applications, as a partition means update queries will hang indefinitely (ie. become unavailable).

The active-active pattern for distributed systems works with PostgreSQL only insomuch as all insert / update queries happen on one server in one site at a time. Network latency for remote sites becomes a serious factor with this model.

Memcache

Memcache has no built-in concept of replication. Any kind of data synchronization between sites must happen in scripts or in-application functionality you write. The application developer has complete control over the consistency versus availability question when partitions happen. If consistency is required, we recommend flushing the memcache on the fail-back event in any case.

Apache / Nginx / Passenger / Unicorn

As none of Apache, nginx, passenger or unicorn inherently have any concept of state built in, "consistency" is not a problem. These technologies are perfectly safe to use without having to worry about fail-over and fail-back transitions.

Having said this, it's actually extremely unlikely your application code or external modules used with Apache or nginx don't use any kind of state information. While the above pieces of software in their simplest form don't need state, your application is probably using state to do what it needs to do. It is not so much what these pieces of software do so much as what they probably touch.

DRBD / HA-NFS

DRBD enforces a synchronous update mode until the other node in the DRBD pair is thought to be down. At this point, the surviving node assumes it is the only survivor and goes into "master" mode, if it wasn't already there.

Taken in light of a distributed network which experiences partitions, without some other form of intervention, the DRBD servers will go into "split brain" mode which is nearly impossible to resolve without simply dumping one half of the cluster or the other. Locally, we use direct crossover cables to minimize the chances of unexpectedly entering split-brain mode. This does not work over a 2000-mile network.

The other main feature of DRBD is it requires a very large amount of network bandwidth to do synchronization between servers in the pair.

Due to these reasons, an inter-site DRBD relationship in a distributed system is not recommended in any circumstance. Instead, we recommend evaluating other solutions which are capable of doing eventual consistency. These include Amazon S3, rsync between site-local DRBD services, application code to push updates to both sites in an asynchronous manner, Hadoop, and Riak.

Since HA-NFS solutions utilize DRBD disks in the underlying technology stack, the same restrictions apply to them as DRBD.

CDNs and other caching technologies

CDNs and other caching technologies are great solutions to use in all cases where strict consistency is not necessary and when serving slightly stale data is not a showstopper for your application. The nature of any cache is to favor availability over consistency. It is generally a good idea to flush caches as part of the fail-back process.

Redis

Redis' replication model resembles a simple master-slave relationship. You will more than likely treat a Redis store exactly as you would a PostgreSQL master-slave relationship.

The problem is Redis is generally used for its speed, and therefore the added latency when used with the active-active pattern generally becomes a showstopper for the single-master Redis approach. If you need to run using this model consider using asynchronous processing and a caching layer, which, yes, basically defeats the purpose of using Redis in the first place.

Hadoop, Riak, and other "eventually consistent" cloud technologies

Since most of these types of systems were designed to work in a distributed system, they usually have the ability to adapt well to being used in… a distributed system. Said another way, in going through the work to make your application work around some of the bizarre restrictions and behaviors of these systems, you're necessarily going to make your application more tolerant of the eventual consistency model.

The important thing to note about these technologies is they are almost all universally written with the idea that availability is far more important than consistency. They all follow the "eventual consistency" model. As such, if strict consistency is required for any aspect of your application, you should not use one of these technologies for that particular data.

The exception to this is the eventually consistent models which allow for the idea of a "strictly consistent" read or write. These are generally far slower to do, and will favor consistency over availability. The ability to do strictly consistent operations will hang or return an error when a partition exists.

Beyond this, it is very important you understand the specific foibles and restrictions of these technologies, especially if an entire data center of cluster nodes is unexpectedly partitioned from the rest of the distributed system. Pay very close attention to how these technologies handle the fail-back transition and remember cross-continent bandwidth is a precious commodity.

Mongo

Mongo has some very interesting capabilities when it comes to data replication and distribution. In general there are two modes of operation.

(Legacy) master-slave mode acts like any other database in master-slave mode and the same restrictions apply as apply to PostgreSQL master-slave or Redis master-slave. If your data set is massive, as is almost always the case for Mongo, beware the enormous bandwidth hit when resynchronizing as part of the fail-back event.

The newer, recommended form of Mongo replication involves using "replica sets." This allows for automatic sharing, scaling, failure detection and correction, redundant storage of data, and other features. It even has the ability to be data center aware so no data center is completely cut off from data during a partition event. In reality, this is still a form of master-slave replication, meaning all writes to the database still need to pass through an elected master server . This master can live in either data center when there is no partition. Again, because of latency realities in a distributed system this can lead to inconsistent performance.

Due to the way election happens in a Mongo replica set, writes can only happen when quorum can be reached. This reveals the implied business decision of consistency over availability, meaning for writes, at least, this is not an eventually-consistent model. Having said this, it appears from the Mongo documentation that the developers of this software have largely solved many of the complexities that need to happen in a Mongo replica set when a fail-over or fail-back event happens. (We have no operational experience here.) In any case, you still need to beware of the massive bandwidth hit that can occur when the fail-back event happens.

Reads with Mongo are "eventually consistent" unless you specify strong consistency for that particular read.

The one other major caveat we feel we need to mention is we have never actually seen a successful replica set implementation. This is not to say it does not work well, but given Mongo's history of losing entire data sets upon server crash, our recommendation is to ensure you are backing up your data often if you care to preserve it.

Rails

There is nothing inherent in the Rails framework which makes it unsuitable for use in a distributed system but you should be aware of some of the trade-offs in the system, in which designers have made CAP theorem decisions without distributed systems in mind, that can cause major headaches if they are not understood.

The biggest of these tradeoffs is probably that in an effort to make the database back-end flavor agnostic, the Rails developers chose to give application designers a way to validate the uniqueness of certain fields outside of the context of the database itself. It is important to note insistence on uniqueness implies a business decision for consistency over availability. If your back-end is eventually consistent the results of this check might not actually be able to return trustworthy results.

You need to be very careful around such conventions if you are making an active-active distributed system.

Where to actually get started

Having read this far, you understand the main purpose of this paper is to attempt to give you the background you actually need in order to start thinking about how to make your single-site application work well in a distributed environment. Once you feel comfortable with all of the above concepts, and assuming you are familiar enough with your own code base to understand the implications of the former light of the latter, I recommend the following strategy with regard to where you should start:

  • Make sure you understand all the concepts discussed in this document.
  • Audit the components in your application in light of the concepts discussed in this document to evaluate their readiness to operate in a distributed environment.
  • Consider the cost and effort involved in getting your application into any of the above distributed system patterns.
  • Decide which pattern you want to successfully achieve given the above.
  • Develop a plan for getting there.

As far as that plan is concerned, we believe the above patterns suggest a logical progression. As such we suggest the following broad outline:

If going from pattern zero to fully manual fail-over:

  • See below for detailed information on things to try and techniques to follow, depending on which automated active + standby pattern you hope to most resemble with your manual process.
  • Carefully consider all the steps you would need to follow to transition your application to another datacenter and back again. Develop these into scripts or checklists to follow.
  • Test your fail-over and fail-back plans.

If going from pattern zero to active + reduced functionality standby:

  • Try running your site with all data-storing components in read-only mode.
  • Make the necessary code-changes such that the important features of your site which can operate in a read-only state actually do so.
  • Develop graceful workarounds / failure modes for those bits which necessarily cannot work in read-only mode.

If going from active + reduced functionality standby to active + full-functionality standby:

  • Develop the fail-over plan in the context of your application's needs. Again, failure at the local site means it should voluntarily fence itself off and refuse to become authoritative without administrator action.
  • Develop a simple check for a site to determine whether it is the "authoritative" site. This check should consider the status of the site-availability service, the health of the other site, and the health of the local site in returning the status of local authority.
  • Alter those sections of code doing writes / updates to your data stores to conditionally do this or operate in read-only mode (ie. fall back to the work-around / graceful failure) depending on the state of local authority.
  • Start introducing asynchronous processing where you can.
  • Develop the fail-back plan. Know what will need synchronization after a partition is resolved as well as administrator process on verifying the status of the reconnected site. Document this and make checklists.
  • Test the fail-over. Test the fail-back. Again, fail-over should be completely automatic. Fail-back should require some administrator action.

If going from active + full functionality standby to active + active:

  • Make everything requiring consistency operate asynchronously if you can. Otherwise you need to sacrifice availability of these components when partitioned and without quorum.
  • Automate as much of your fail-back plan as you can. If it cannot be completely automated, know that you will need on-call administrators to handle the fail-backs.

We hope you have found this series on Distributed Systems Design interesting and informative. All further reading resources can be found at the end of Distributed Systems Design Part One.

SHARE

Distributed Systems Design - Part 3/4 Josh Yotty Joins Blue Box
Q

Have you read the news?


Rumor has it this cloud thing is taking off.