Brewer's CAP Theorem says that in a distributed system you can
guarantee at most two of Consistency, Availability and Partition
Tolerance. So there's a trade-off: when you are designing your system
you will have to decide which two of these three properties you want to
always maintain, and which one you are prepared to drop. However,
there appears to be a bit of confusion about what it would really
mean to drop partition tolerance. Is that even possible? (For example,
see
You
Can't Sacrifice Partition Tolerance by Coda Hale.)
In fact you
can "trade off" partition tolerance and build a
system that guarantees both consistency and availability. This is
exactly the design decision that was made by the engineers who built
traditional telephone networks. However, this decision wasn't a "trade
off" in the usual sense, where you get to save money on one thing and
spend it on something else — instead they had to spend quite a
lot
more money to build the unusually reliable nodes and
communication links that let them drop partition tolerance as a whole-system
property.
To see how this works, I'll first give a (very brief) explanation of
what the CAP Theorem says, in order to pave the way for a
(fairly brief) explanation of the techniques you can use to build
reliable sub-systems. If you haven't come across the CAP Theorem before, I
think the nicest introduction is
Brewer's
CAP Theorem by Julian Browne. (Some of what I say here is also
based on
Perspectives
on the CAP Theorem by Seth Gilbert and Nancy Lynch, which is a more
detailed overview; it's also worth reading
CAP
Twelve Years Later: How the "Rules" Have Changed, which is a
retrospective and commentary by Eric Brewer himself.)
Consistency This is close to the distributed systems idea of
"safety". A system is "safe" if it never says anything wrong, if every
response sent to any client of the system is always correct. In
systems of any complexity, this often amounts to saying that every
request must
appear to all clients to have been executed
atomically (in a single instant) by some single central node.
Availability This is close to the distributed systems idea of
"liveness". A system is "live" if every request from every client eventually
receives a response. In practice there's a lower limit on how quick a
response could be (light-speed delay across whatever fraction of the
distributed system is necessary for that response) and there's an upper
limit, after which clients or people get bored, decide that the system
has failed and decide to take remedial action.
Partition Tolerance We say that a system has partition
tolerance if it behaves as intended by its designers in the face of
arbitrary message delay or message loss in the underlying
communications system. Usually this will be because of a "network
partition" where some link or node fails, but in best-effort networks
this can also include congestion, for example due to packets being
dropped on a congested port. A major practical problem with partition
tolerance is that very often the different parts of a distributed
system will disagree about whether or not there currently
is a
partition.
The CAP Theorem says that you can build a distributed system with any
two of these three properties. Traditional "ACID" SQL databases choose to
drop availability: they delay responses to clients until they are
certain to be consistent. More avant-garde "BASE" NoSQL systems choose
to drop consistency: under pressure they give fast but possibly
out-of-date responses and patch things up later. And old-fashioned
telephone networks drop partition tolerance: they use nodes and
communication links which are so reliable that the distributed system
(almost) never sees message loss or arbitrary delay. But how do you do
that?
The usual pragmatic solution in any situation where a component might
fail is to replicate that component. For example, a plane with two
engines should be able to reach its destination if one of them
fails. In our case things are slightly more interesting because if we
replicate nodes, and these check each other, what's to
stop a failed node wrongly accusing a correct node of having failed?
A bad node could shut down a good node! To get over this problem,
we build nodes into fail-stop pairs with a separate checker:
This sub-system works roughly like this: a request comes into the
checker on one (or both) of the links on right.
The checker forwards this request to both node1 and
node2. These nodes are exact duplicates, and work separately to produce
what should be identical responses. The checker makes sure that these
responses match, in which case it sends that response back to the
client. But if the responses don't match, the checker stops and
refuses to communicate on any of its ports. Making this sub-system run
again is a maintenance action. (When this architecture is
used in a railway-signalling system, the checker might physically blow
a fuse to make certain that it won't work again without maintenance.)
In addition to
making sure that the results from node1 and node2 match, the checker
also sets a time limit for their responses, and similarly fail-stops
if this limit is exceeded. (So at the lowest level, the checker
enforces availability or "liveness" as well as consistency between
node1 and node2.)
There are a lot of subtleties here. To defend against software
errors, it is preferable to have two different implementations of the
system in node1 and node2. To defend against hardware failure, it is
preferable to have different hardware in node1 and node2, or at least
to have a different encoding for data on the interfaces to the two
nodes. (In this case the checker ensures that responses are
equivalent, rather than bit-identical.) Each node may also run
"routining" code in
the background which continually checks on the consistency of its
internal data, to guard against bit-errors in memory. If a node finds
such an error it logs this problem and then simply stops itself. (The
checker will then cleanly fail the whole sub-system when
it subsequently doesn't get a response to a request.)
And what happens if the checker fails? It should be possible to build
a checker which is more reliable than a node, because it is usually
much simpler than a node. However, it's going to fail sooner or later.
If it completely stops, that's actually
ok, but it might fail in some more insidious way, and not detect a
subsequent failure in node1 or node2. Depending on how paranoid you
are feeling, you might therefore double-up the checker, so that both
responses get passed through the first checker and checked again by the
second checker. (Or more ingeniously, you might apply these same
techniques recursively inside the checker.)
With this architecture, we solve one of our most tricky problems:
reliably detecting partitions of one node. We can combine two of these
fail-stop pairs into a master-slave combination continually
exchanging I-am-still-alive signals, usually in the form of the data
updates needed to keep the slave's data-structures in sync with the
master.
If the master fails it will stop cleanly, the slave will
notice this after a short, predictable delay and will take over from
the master. (An alternative architecture which is sometimes used at
the lowest level is to
have three nodes rather than two and to arrange for a two-out-of-three
majority vote at the checker. This requires a more complex and
therefore more error-prone checker, but has the advantage that when a
one node fails, recovery is immediate and the remaining two good nodes
can continue as a fail-stop pair.)
In this way we can
build resilient processing nodes and we can use the same techniques to
build resilient
communications nodes which forward data from one link to another. And
then by a combination of
link replication, forward-error-correction, bandwidth reservation and
automatic fail-over from one link to another we can ensure that
failures on a few links cannot impede traffic in the communication
network for more than a short
period. (It is customary to "over-dimension" these systems so that
they have considerably more capacity than their predicted peak load.)
If this architecture is taken to the extremes seen in
traditional telephone switches, it's even possible to completely
decommission and rebuild a switch while all the time it carries its
rated traffic.
So you
can "trade off" partition tolerance, but it's actually
rather costly. It's rather curious that by making the sub-systems more
fragile, we can make the whole system more robust. There's an almost
biological feel to this — it's a bit like apopotosis or
"programmed cell-death", where a cell detects that it's going a bit
wrong and rather than turn into a cancer it commits
suicide very cleanly. It's also rather curious that the properties we
enforce at
the lowest level of the checker are consistency and availability
— exactly the properties that we want to have at the top level.
In practice, as noted by Gilbert, Lynch and Brewer in the papers I
mentioned earlier, in real systems we never trade off all of
consistency, availability or partition tolerance. In practice, we
compromise a little on one to gain a little more of another, and we
make different compromises at different times or for different
purposes. But if you see a system which appears to be getting more
than its fair share of both consistency and availability, look a
little closer: it must be based on low level resilience using the sort
of techniques I've described here.