NEWS // Blog

RESOURCES

Distributed Systems Design - Part 2/4

Share on Twitter Share on Facebook Share on Google Plus Share on LinkedIn

Blue Box is proud to present this series on understanding the basics of distributed systems design.

Distributed Systems Design — Part 1/4, posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design.

Some useful distributed design patterns

Now that we have gone through the fundamental concepts and have familiarized you with the key differences you need to think about when programming and designing for a distributed system, you are probably wondering how you take your application which works well in a single cluster in a single datacenter and make it work well in a distributed fashion.  This section of the document is aimed at giving you some pointers to get started.

Note that there are a lot of different common design patterns that are used here (and not every pattern is covered below).  Which is right for you will depend a lot on some more business decisions revolving around budget, willingness to alter code, and perceived cost of down-time / reduced functionality during a partition.

Running under optimal conditions

You will have absolutely no ability to get your application to work in a failover or partitioned environment if you cannot get it to work when the network is not having problems.  As such, we will discuss achieving this before we talk about handling partition events.

We highly recommend ensuring any given site-specific cluster which makes up part of your distributed application is fully redundant, with no single points of failure. A local failover due to a failing piece of hardware is far easier to deal with than a complete failover to a remote site due to a failure in a single-point of failure on the local network.  The most dangerous times for a distributed application happen during the fail-over and fail-back stages, so minimizing the number of times you have to actually do this is an excellent idea.

There are a few standard patterns which get used here, and a couple variations of these patterns that should be discussed.  Please note we will devote the majority of the discussion below toward patterns where fail-over events are automatic.

Manual fail-over patterns

Pattern zero

Pattern zero is not actually a distributed pattern at all.  Rather, it is the term we are giving to the minimal cluster design we recommend you achieve at a single site before you can realistically start to contemplate creating a distributed application.  The key characteristics to pattern zero are that all or almost all single-points of failure within the cluster design have been addressed. Also, regular off-site backups are being made of all of the data you would need to set up a similar cluster elsewhere.

It should be noted that unless you are making regular off-site backups of your key data, you are taking a business-continuity risk.  If a major data-destroying disaster occurs at your data center without off-site backups, there will be no way to restore the lost data.  If you are dependent on your application data to do business (as most are), this could spell doom for your business.

This pattern's characteristics are:

  • External DNS service not necessary
  • All or almost all single points of failure have been eliminated in the single-site cluster.
  • Off-site backups of all important data are being kept.

Fully-manual fail-over pattern

In a fully-manual fail-over pattern, enough of an application footprint is kept in standby at a remote data center to resume some semblance of normal business operations in a relatively short amount of time.   However, all the steps for transitioning to the standby data center are managed manually.  This is the most basic system design which can be considered distributed.  Except for the fact that fail-over and fail-back happens only with administrator intervention, this pattern can otherwise closely resemble either the active + reduced functionality standby or active + full functionality standby patterns discussed below.

If your goal is to turn a single-site (pattern zero) application into a distributed application with automated fail-over, designing the system first for fully-manual fail-over can be a logical step in the process.

In coming up with your design, we highly recommend you consider each task which needs to happen for a successful fail-over or fail-back event and develop detailed checklists or scripts for administrators to follow in case a manual fail-over becomes necessary. These scripts also become your guide for which tasks you need to automate, if you want to ultimately achieve some degree of an automatic fail-over pattern.

This pattern's characteristics are:

  • Requires DNS service which can be administered when the primary site is down.
  • Detailed checklists or scripts are used in lieu of automated processes for fail-over and fail-back events.
  • At a minimum, off-site backups of all vital data are necessary.
  • Otherwise, this pattern highly resembles one of the two active + standby patterns detailed below.

Automatic fail-over patterns

Active + reduced functionality standby

In the active + reduced functionality standby pattern, an external DNS service is used which has the capability of doing health checks against the full-production site It will automatically send traffic to the standby site if the health checks detect a failure in the production site.  Under normal running conditions, all traffic gets sent to just the primary / active / full-production site.  The standby site itself can range from as simple as a single service which displays a "Site is down right now" message to clients, all the way up to something closely resembling production with only the most sensitive features of the site disabled.

In general, with this pattern, we recommend concentrating on making the standby site do as much as possible without requiring state information from the production site.  In other words, the standby-site runs in "read only" mode.  This way, failing over to the standby site and failing back can happen without much intelligence needing to be built in around synchronization. Most of the thinking / logic that has already gone into designing an application which functions in a non-distributed way does not need to change since the standby site is essentially stateless.

In a nutshell, this pattern's characteristics are:

  • This is usually the least expensive option to implement from a software engineering perspective.
  • Requires external DNS service capable of doing site health checks.
  • Simplest automated pattern to implement when coming from single-site clustered application.
  • Standby site often does not need as much hardware as full-production site in order to work correctly.
  • If standby site implements some, but not all, production site functionality, testing your code becomes more difficult.
  • Fail-overs and fail-backs are relatively simple.
  • Generally this is the least-expensive of the distributed architecture patterns from a raw infrastructure cost.

Active + full-functionality standby

In the active + full-functionality standby pattern, an external DNS service is used which has the capability of doing health checks against the active site. It will automatically send traffic to the standby site if health checks fail against the active site.  Under normal running conditions, all traffic gets sent to just the active site.  The difference between this pattern and the one above is the standby site in this case synchronizes state information with the active site so that functionality is not reduced when fail-overs happen.

In this pattern, significantly more intelligence needs to be built around the fail-over and fail-back functionality.  Most of the work in getting the fail-over to happen will be in making sure the right data is being synchronized in the first place.  If this happens correctly, then fail-overs tend to be pretty simple.  Otherwise, this pattern also allows software engineers to mostly use much of the same code and thinking that went around single-site application programming.  This is because the normal situation is only one site will be active at a time and, therefore, less state data needs to be constantly or immediately shared between sites.

The fail-back process can be fairly involved depending on the specific technologies used in the the application in the first place.  We generally do not recommend fail-backs to be an automated event in this environment.

This pattern's characteristics are:

  • This pattern's software engineering expense lies in designing the synchronization but otherwise does not differ much from the active + partial-functionality standby pattern.
  • Requires external DNS service capable of doing site health checks.
  • Requires synchronization of some state data (eg. database / shared filesystem / etc.) between sites.
  • Standby site needs as much hardware as active site in order to avoid reduced functionality or performance during partition.
  • Same code runs in both sites.
  • Fail-overs tend to be simple.  Fail-backs can be difficult and should generally be done manually.
  • This tends to be the most expensive of the distributed architecture patterns from a raw infrastructure cost, as half of your hardware will essentially be sitting idle all the time.

Active + Active

In the active + active pattern, production site traffic is sent to all sites making up the distributed application at all times.  It is the most difficult pattern to implement when coming from a legacy application written for a single-site environment.  It is also the one most capable of dealing with network instability.  This is the pattern all the big guys out there use, including Google, eBay, and Amazon. It is often the one most clients would like to achieve with their distributed application.

As we can send traffic to many sites at the same time with this pattern, it opens up a number of other interesting options with regard to Global Server Load Balancing (GSLB).  As with the other patterns discussed here, an external DNS service capable at least of doing health checks is required.  With this pattern we may be able to use more advanced features of an external DNS service as well.

With the active + active pattern, a lot more work generally needs to be done within the application to get it to behave correctly in this environment.  Specifically, work needs to be done, or at least evaluations made, around every point in the application where state data is touched.

This pattern's characteristics are:

  • This is by far the most expensive pattern to implement from a software-engineering perspective.
  • Requires external DNS service capable for doing site health checks.  May be able to utilize other GSLB features of external DNS service.
  • Requires synchronization or other intelligence around handling all state data between sites.
  • Active / active sites need to be approximately equal size and have enough redundant capacity to take the full load in case of a complete site outage (think at least N+1 redundancy here.)
  • Same code runs at all sites.
  • Fail-overs tend to be simple.  Fail-backs tend to still be very complicated, but should be automated as much as possible.
  • Tends to be fairly expensive from a raw infrastructure perspective. With two sites, 50% of server capacity will be idle.  With three sites, 33% of server capacity will be idle.

What about auto-scaling the standby site?

If your failover strategy includes spinning up the standby site on the fly during a partition event, please note we generally do not recommend attempting auto-scaling the standby site for the following reasons:

  • Testing these kinds of fail-overs tends to be difficult and inconclusive unless you're already in the habit of regularly performing these tests.
  • During an actual partition, the fewer moving parts you have and the fewer topology changes you have to make, the better.
  • Certain technologies, such as database servers, simply do not lend themselves to doing this well.
  • Spinning up and configuring new servers tends to take a non-trivial amount of time on most cloud service providers today. Ranges are generally measured in minutes for this to happen and they must come down to seconds for it to be viable.
  • The failure of a major site can result in a rush on capacity in other data centers.  This means spinning up new servers will take even longer or may fail entirely, if the datacenter runs out of capacity.  The cloud provides the illusion of infinite capacity but no cloud actually can provide infinite capacity.  When the partition happens, it is far better for your servers to already be spun up.

Making your application capable of going active + active

When considering how to make your single-site application go to an active + active distributed pattern, you need to concentrate on two key areas, stateful/shared data and cache usage.

Any time you touch cache, stateful, or shared data in your application, you need to make the CAP theorem business decision around that interaction.

Ideally, you want all aspects of your application to be able to operate with an eventual-consistency model, at least to a limited degree (because of the effect of latency acting like a partition).  Where consistency of a given piece of data from one node of the cluster to another is not absolutely necessary you want your code to be able to treat the data like it can be at least partially inconsistent.

To help you make the distinction of what you are already doing in your code around the CAP theorem decisions, consider this:

  • Anywhere in your code where you use cache, you already made the business decision availability is more important than consistency, for at least a little while, for that data.
  • Anywhere in your code where you specifically cannot use cache, or must go to a "master" server (ex. if MySQL slaves are used in your cluster), then you have effectively made the business decision consistency is more important than availability.
  • In most applications, "read" operations tend to not need strict atomic consistency and, therefore, can work with the eventual consistency model.
  • In most applications, many "write" operations are able to operate with the eventual consistency model if asynchronous processing is used. For most applications, though, there will almost always be a few "write" operations that require strict consistency.

Those pieces where the second point above apply need to be very carefully evaluated to see whether consistency truly is required here or whether asynchronous processing can be used at all.

We suggest approaching these problems as follows:

  • Go through your application with a fine-toothed comb looking for any instances where an external service (ie., one not on "localhost",) is used.  Make a list of all of these, as well the apparent decision about whether you are implicitly choosing availability or consistency with the interaction.
  • If any of these instances can be eliminated without making your application suffer, eliminate them.
  • For those services that are not absolutely necessary for main site functionality, alter the code around them to allow the external service to fail without bringing the whole application down.
  • Use asynchronous processing wherever you are able and where it makes sense to do so.
  • For those services where consistency is chosen over availability, evaluate whether consistency is absolutely necessary.  Try to use asynchronous processing where you can.
  • If you are not able to eliminate all instances in the application where consistency is 100% necessary, then you must write code to detect when a partition is happening and gracefully disable those features.  We will discuss how to reliably detect partitioning shortly.

When you just can’t sacrifice consistency

When you can’t sacrifice consistency, active + active can still work when partitions don't exist.  For example, if your "master" PostgreSQL server exists in Seattle,  under normal running conditions your application servers in Virginia must connect to the "master" PostgreSQL server in Seattle (over an SSH or VPN tunnel) in order to perform these sensitive operations.  Try to minimize the occurrence of this kind of activity as the latency of the coast-to-coast link will already add a lot to the response time of your Virginia application servers.

When partitions happen, you must be prepared to sacrifice availability of those site features where consistency is absolutely necessary.

*All references are provided at the end of Distributed Systems Design — Part 1/4.

SHARE

Distributed Systems Design - Part 1/4 Distributed Systems Design - Part 3/4
Q

Have you read the news?


Rumor has it this cloud thing is taking off.