Sunday 22 September 2013

Putting the 'P' in CAP

Brewer's CAP Theorem says that in a distributed system you can guarantee at most two of Consistency, Availability and Partition Tolerance. So there's a trade-off: when you are designing your system you will have to decide which two of these three properties you want to always maintain, and which one you are prepared to drop. However, there appears to be a bit of confusion about what it would really mean to drop partition tolerance. Is that even possible? (For example, see You Can't Sacrifice Partition Tolerance by Coda Hale.)

In fact you can "trade off" partition tolerance and build a system that guarantees both consistency and availability. This is exactly the design decision that was made by the engineers who built traditional telephone networks. However, this decision wasn't a "trade off" in the usual sense, where you get to save money on one thing and spend it on something else — instead they had to spend quite a lot more money to build the unusually reliable nodes and communication links that let them drop partition tolerance as a whole-system property.

To see how this works, I'll first give a (very brief) explanation of what the CAP Theorem says, in order to pave the way for a (fairly brief) explanation of the techniques you can use to build reliable sub-systems. If you haven't come across the CAP Theorem before, I think the nicest introduction is Brewer's CAP Theorem by Julian Browne. (Some of what I say here is also based on Perspectives on the CAP Theorem by Seth Gilbert and Nancy Lynch, which is a more detailed overview; it's also worth reading CAP Twelve Years Later: How the "Rules" Have Changed, which is a retrospective and commentary by Eric Brewer himself.)

Consistency This is close to the distributed systems idea of "safety". A system is "safe" if it never says anything wrong, if every response sent to any client of the system is always correct. In systems of any complexity, this often amounts to saying that every request must appear to all clients to have been executed atomically (in a single instant) by some single central node.

Availability This is close to the distributed systems idea of "liveness". A system is "live" if every request from every client eventually receives a response. In practice there's a lower limit on how quick a response could be (light-speed delay across whatever fraction of the distributed system is necessary for that response) and there's an upper limit, after which clients or people get bored, decide that the system has failed and decide to take remedial action.

Partition Tolerance We say that a system has partition tolerance if it behaves as intended by its designers in the face of arbitrary message delay or message loss in the underlying communications system. Usually this will be because of a "network partition" where some link or node fails, but in best-effort networks this can also include congestion, for example due to packets being dropped on a congested port. A major practical problem with partition tolerance is that very often the different parts of a distributed system will disagree about whether or not there currently is a partition.

The CAP Theorem says that you can build a distributed system with any two of these three properties. Traditional "ACID" SQL databases choose to drop availability: they delay responses to clients until they are certain to be consistent. More avant-garde "BASE" NoSQL systems choose to drop consistency: under pressure they give fast but possibly out-of-date responses and patch things up later. And old-fashioned telephone networks drop partition tolerance: they use nodes and communication links which are so reliable that the distributed system (almost) never sees message loss or arbitrary delay. But how do you do that?

The usual pragmatic solution in any situation where a component might fail is to replicate that component. For example, a plane with two engines should be able to reach its destination if one of them fails. In our case things are slightly more interesting because if we replicate nodes, and these check each other, what's to stop a failed node wrongly accusing a correct node of having failed? A bad node could shut down a good node! To get over this problem, we build nodes into fail-stop pairs with a separate checker:

This sub-system works roughly like this: a request comes into the checker on one (or both) of the links on right. The checker forwards this request to both node1 and node2. These nodes are exact duplicates, and work separately to produce what should be identical responses. The checker makes sure that these responses match, in which case it sends that response back to the client. But if the responses don't match, the checker stops and refuses to communicate on any of its ports. Making this sub-system run again is a maintenance action. (When this architecture is used in a railway-signalling system, the checker might physically blow a fuse to make certain that it won't work again without maintenance.) In addition to making sure that the results from node1 and node2 match, the checker also sets a time limit for their responses, and similarly fail-stops if this limit is exceeded. (So at the lowest level, the checker enforces availability or "liveness" as well as consistency between node1 and node2.)

There are a lot of subtleties here. To defend against software errors, it is preferable to have two different implementations of the system in node1 and node2. To defend against hardware failure, it is preferable to have different hardware in node1 and node2, or at least to have a different encoding for data on the interfaces to the two nodes. (In this case the checker ensures that responses are equivalent, rather than bit-identical.) Each node may also run "routining" code in the background which continually checks on the consistency of its internal data, to guard against bit-errors in memory. If a node finds such an error it logs this problem and then simply stops itself. (The checker will then cleanly fail the whole sub-system when it subsequently doesn't get a response to a request.)

And what happens if the checker fails? It should be possible to build a checker which is more reliable than a node, because it is usually much simpler than a node. However, it's going to fail sooner or later. If it completely stops, that's actually ok, but it might fail in some more insidious way, and not detect a subsequent failure in node1 or node2. Depending on how paranoid you are feeling, you might therefore double-up the checker, so that both responses get passed through the first checker and checked again by the second checker. (Or more ingeniously, you might apply these same techniques recursively inside the checker.)

With this architecture, we solve one of our most tricky problems: reliably detecting partitions of one node. We can combine two of these fail-stop pairs into a master-slave combination continually exchanging I-am-still-alive signals, usually in the form of the data updates needed to keep the slave's data-structures in sync with the master. If the master fails it will stop cleanly, the slave will notice this after a short, predictable delay and will take over from the master. (An alternative architecture which is sometimes used at the lowest level is to have three nodes rather than two and to arrange for a two-out-of-three majority vote at the checker. This requires a more complex and therefore more error-prone checker, but has the advantage that when a one node fails, recovery is immediate and the remaining two good nodes can continue as a fail-stop pair.)

In this way we can build resilient processing nodes and we can use the same techniques to build resilient communications nodes which forward data from one link to another. And then by a combination of link replication, forward-error-correction, bandwidth reservation and automatic fail-over from one link to another we can ensure that failures on a few links cannot impede traffic in the communication network for more than a short period. (It is customary to "over-dimension" these systems so that they have considerably more capacity than their predicted peak load.) If this architecture is taken to the extremes seen in traditional telephone switches, it's even possible to completely decommission and rebuild a switch while all the time it carries its rated traffic.

So you can "trade off" partition tolerance, but it's actually rather costly. It's rather curious that by making the sub-systems more fragile, we can make the whole system more robust. There's an almost biological feel to this — it's a bit like apopotosis or "programmed cell-death", where a cell detects that it's going a bit wrong and rather than turn into a cancer it commits suicide very cleanly. It's also rather curious that the properties we enforce at the lowest level of the checker are consistency and availability — exactly the properties that we want to have at the top level.

In practice, as noted by Gilbert, Lynch and Brewer in the papers I mentioned earlier, in real systems we never trade off all of consistency, availability or partition tolerance. In practice, we compromise a little on one to gain a little more of another, and we make different compromises at different times or for different purposes. But if you see a system which appears to be getting more than its fair share of both consistency and availability, look a little closer: it must be based on low level resilience using the sort of techniques I've described here.


  1. Bullcrap. It's not pick any two out of three. The theorem is that GIVEN that partitions can and will exist and there is nothing you can do about this, you cannot have both consistency and availability. In the above example, the checker refusing to provide an answer is a failure of Availability. The system is consistent, but there are cases where it is not available. The above example is also not resistant to partitions. It is true that partitions can occur due to network congestion or misconfiguration. However it is more likely that a partition can result due to a node being overloaded and unable to properly function. It doesn't just fail: it continues running and indeed thinks all is well. Meanwhile, the rest of the system sees that node as failed. The duration of time during which the overloaded node thinks that it is part of the system and the system considers it as failed may be short or long, but it is a partition. How the system chooses to handle this fact is up to the system. But the system does not get to "choose" whether it will experience partitions or not, nor is there any way to prevent partitions.

    1. Thanks for your comment.

      As to what the CAP theorem states, I think we can rely on Brewer to get it right. In "CAP Twelve Years Later: How the Rules Have Changed" he says: "The CAP theorem states that any networked shared-data system can have at most two of three desirable properties: consistency (C) equivalent to having a single up-to-date copy of the data; high availability (A) of that data (for updates); and tolerance to network partitions (P)."

      Brewer does go on to say the "2 of 3" is misleading, because it oversimplifies the tensions between the three properties. (But I think Brewer has chosen his words carefully: misleading is not the same as false.) Partition tolerance is essentially tolerance to arbitrary message loss. As Brewer notes: "The general belief is that for wide-area systems, designers cannot forfeit P". (Again, note the choice of words: general belief is not the same as truth.)

      The designers of traditional telecom systems tried hard to build systems which did not suffer from arbitrary message loss. They did this by building systems which were replicated, fragile and fail-stop at the lowest level, so that they could be resilient at the next level up, and that way the system running on top of all that could (largely) ignore the problem of message loss. I think it's interesting to present these techniques (invented a long time before the CAP theorem) as an attempt to "forfeit P" in order to have both "C" and "A".