[CentOS] two 2-node clusters or one 4-node cluster?

Fri Jul 6 16:17:16 UTC 2018
Warren Young <warren at etr-usa.com>

On Jul 5, 2018, at 11:10 AM, Digimer <lists at alteeve.ca> wrote:
> 
> There is no effective benefit to 3+ nodes

That depends on your requirements and definition of terms.

For one very useful (and thus popular) definition of “success” in distributed computing, you need at least four nodes, and that only allows you to catch a single failed node at a time:

    http://lamport.azurewebsites.net/pubs/reaching.pdf

See section 4 for the mathematical proof.  This level of redundancy is necessary to achieve what is called Byzantine fault tolerance, which is where we cannot trust all of the nodes implicitly, and thus must achieve consistency by consensus and cross-checks.

A less stringent success criterion assumes a less malign environment and allows for a minimum of three nodes to prevent system failure in the face of a single fault at a time:

    http://lamport.azurewebsites.net/pubs/lower-bound.pdf

With your suggested 2-node setup, you only get replication, which is not the same thing as distributed fault tolerance:

    http://paxos.systems/how.html

To see the distinction, consider what happens if one node in a mirrored pair goes down and then the one remaining up is somehow corrupted before the downed node comes back up: the second will be corrupted as soon as it comes back up because it mirrors the corruption.

The n >= 2f+1 criterion prevents this problem in the face of normal hardware or software failure, with a minimum of 3 nodes required to reliably detect and cope with a single failure at a time.

The more stringent n > 2m+1 criterion prevents this problem in the face of nodes that may be actively hostile, with 4 nodes being required to reliably catch a single traitorous node.

That terminology comes from one of the most important papers in distributed computing, “The Byzantine Generals Problem,” co-authored by Leslie Lamport, who was also involved in all of the above work:

    https://www.microsoft.com/en-us/research/uploads/prod/2016/12/The-Byzantine-Generals-Problem.pdf

And who is Leslie Lamport?  He is the 2013 Turing Award winner, which is as close to the Nobel Prize as you can get with work in pure computer science:

    https://en.wikipedia.org/wiki/Leslie_Lamport

So, if you want to argue with the papers above, you’d better bring some pretty good arguments. :)

If his current affiliation with Microsoft bothers you, realize that he did all of this work prior to joining Microsoft in 2001.

Also, he’s also the “La” in LaTeX. :)

> (quorum is arguably helpful, but proper stonith, which you need anyway, makes it mostly a moot point).

STONITH is orthogonal to the concepts expressed in the CAP theorem:

   https://en.wikipedia.org/wiki/CAP_theorem

It is mathematically impossible to escape the restrictions of the CAP theorem.  I’ve seen people try, but it inevitably amounts to Humpty Dumpty logic: "When I use a word," Humpty Dumpty said, in rather a scornful tone, "it means just what I choose it to mean — neither more nor less.”  You can win as many arguments as you like if you get to redefine the foundational terms upon which the argument is based to suit your needs at different stages of the argument.

With that understanding, we can say that a 2-node setup results in one of the following consequences:


OPTION 1: Give up Consistency (C)
---------------------------------
If you give up consistency, you get an AP system:

- While one node in a mirrored pair is down, the other is up giving you availability (A), assuming all clients can normally see both nodes.

- Since you give up on consistency (C), you can put the two nodes in different data centers to gain partition tolerance (P) over the data paths within and between those data centers. This only gets you availability as well if both data centers can be seen by all clients.

In other words, you can tolerate either a loss of a single node or a partition between them, but not both at the same time, and while either condition applies, you cannot guarantee consistency in query replies.

This mode of operation is sometimes called “eventual consistency,” meaning that it’s expected that there will be periods of time where multiple nodes are online but they don’t all respond to identical queries with the same data.


OPTION 2: Require Consistency
-----------------------------
In order to get consistency, a 2-node system behaves like it is a bare-minimum quorum in a faux 3-node cluster, with the third node always MIA.  As soon as one of the two “remaining” nodes goes down, you have two bad choices:

1. Continue to treat it as a cluster.  Since you have no second node to check your transactions against to maintain consistency, the cluster must stop write operations until the downed node is restored.  Read-only operations can continue, if we’re willing to accept the risk of the remaining system somehow becoming corrupted without going down entirely.

2. Split the cluster so that the remaining node is a standalone instance, which means you  have no distributed system any more, so the CAP theorem no longer applies.  You might continue to have C-as-in-ACID, if your software stack is ACID-compliant in the first place, but you lose C-as-in-CAP until the second system comes back up and is restored to full operation.

We can have a CA cluster in one of the two senses above.  Only the first is a true cluster, though, and it’s not useful in many applications, since it blocks writes.

CP has no useful meaning in this situation, since a network partition will split the cluster, so that you don’t actually have partition tolerance.  This is why the minimum number of nodes is 3: so that a quorum remains on one side of the partition!


Regardless of which path you pick above, you lose something relative to an n >= 2f+1 or an n > 2m+1 cluster.  TANSTAAFL.

> Lastly, with 2x 2-node, you could lose two nodes
> (one per cluster) and still be operational. If you lose 2 nodes of a
> four node cluster, you're offline.

I don’t see how both statements can be true under the same CAP design restrictions, as laid out above.  I think if you analyze the configuration of your different scenarios, you’ll find that they’re not in the same CAP regime.

Lacking more information, I’m guessing this 2x2 configuration you’re thinking of is AP, and your 4-min-3 configuration is CP.

A 4-node AP configuration could lose 2 nodes and keep running, as long as all clients can see at least one of the remaining nodes, since you’ve given up on continual consistency.