Hello, I'm planning migration of current two clusters based on CentOS 6.x with Cman/Rgmanager going to CentOS 7.x and Corosync/Pacemaker.
As the clusters and their services are on the same subnet, and there no particular security concerns differentiating them, I'm also evaluating the option to transform the two clusters into a unique 4-node one during the upgrade.
Currently I'm testing a virtual 4-node CentOS 7.4 cluster inside oVirt 4.2 and things seem to behave well.
Before going further in deep with tests and so on, I'd like to check with the community about how many CentOS 7.x clusters composed by more than two nodes are in place and what are the feedbacks on them in terms of incremented latency/communication, ecc scaling out.
Also general feedback related to CentOS 6 and scalability of cluster nodes number is welcome.
Thanks in advance, Gianluca
On 2018-07-05 11:27 AM, Gianluca Cecchi wrote:
Hello, I'm planning migration of current two clusters based on CentOS 6.x with Cman/Rgmanager going to CentOS 7.x and Corosync/Pacemaker.
As the clusters and their services are on the same subnet, and there no particular security concerns differentiating them, I'm also evaluating the option to transform the two clusters into a unique 4-node one during the upgrade.
Currently I'm testing a virtual 4-node CentOS 7.4 cluster inside oVirt 4.2 and things seem to behave well.
Before going further in deep with tests and so on, I'd like to check with the community about how many CentOS 7.x clusters composed by more than two nodes are in place and what are the feedbacks on them in terms of incremented latency/communication, ecc scaling out.
Also general feedback related to CentOS 6 and scalability of cluster nodes number is welcome.
Thanks in advance, Gianluca
I always prioritize simplicity and isolation, so I vote fore 2x 2-nodes. There is no effective benefit to 3+ nodes (quorum is arguably helpful, but proper stonith, which you need anyway, makes it mostly a moot point).
Keep in mind; If your services are critical enough to justify an HA cluster, they're probably important enough that adding the complexity/overhead of larger clusters doesn't offset any hardware efficiency savings. Lastly, with 2x 2-node, you could lose two nodes (one per cluster) and still be operational. If you lose 2 nodes of a four node cluster, you're offline.
On Jul 5, 2018, at 11:10 AM, Digimer lists@alteeve.ca wrote:
There is no effective benefit to 3+ nodes
That depends on your requirements and definition of terms.
For one very useful (and thus popular) definition of “success” in distributed computing, you need at least four nodes, and that only allows you to catch a single failed node at a time:
http://lamport.azurewebsites.net/pubs/reaching.pdf
See section 4 for the mathematical proof. This level of redundancy is necessary to achieve what is called Byzantine fault tolerance, which is where we cannot trust all of the nodes implicitly, and thus must achieve consistency by consensus and cross-checks.
A less stringent success criterion assumes a less malign environment and allows for a minimum of three nodes to prevent system failure in the face of a single fault at a time:
http://lamport.azurewebsites.net/pubs/lower-bound.pdf
With your suggested 2-node setup, you only get replication, which is not the same thing as distributed fault tolerance:
To see the distinction, consider what happens if one node in a mirrored pair goes down and then the one remaining up is somehow corrupted before the downed node comes back up: the second will be corrupted as soon as it comes back up because it mirrors the corruption.
The n >= 2f+1 criterion prevents this problem in the face of normal hardware or software failure, with a minimum of 3 nodes required to reliably detect and cope with a single failure at a time.
The more stringent n > 2m+1 criterion prevents this problem in the face of nodes that may be actively hostile, with 4 nodes being required to reliably catch a single traitorous node.
That terminology comes from one of the most important papers in distributed computing, “The Byzantine Generals Problem,” co-authored by Leslie Lamport, who was also involved in all of the above work:
https://www.microsoft.com/en-us/research/uploads/prod/2016/12/The-Byzantine-...
And who is Leslie Lamport? He is the 2013 Turing Award winner, which is as close to the Nobel Prize as you can get with work in pure computer science:
https://en.wikipedia.org/wiki/Leslie_Lamport
So, if you want to argue with the papers above, you’d better bring some pretty good arguments. :)
If his current affiliation with Microsoft bothers you, realize that he did all of this work prior to joining Microsoft in 2001.
Also, he’s also the “La” in LaTeX. :)
(quorum is arguably helpful, but proper stonith, which you need anyway, makes it mostly a moot point).
STONITH is orthogonal to the concepts expressed in the CAP theorem:
https://en.wikipedia.org/wiki/CAP_theorem
It is mathematically impossible to escape the restrictions of the CAP theorem. I’ve seen people try, but it inevitably amounts to Humpty Dumpty logic: "When I use a word," Humpty Dumpty said, in rather a scornful tone, "it means just what I choose it to mean — neither more nor less.” You can win as many arguments as you like if you get to redefine the foundational terms upon which the argument is based to suit your needs at different stages of the argument.
With that understanding, we can say that a 2-node setup results in one of the following consequences:
OPTION 1: Give up Consistency (C) --------------------------------- If you give up consistency, you get an AP system:
- While one node in a mirrored pair is down, the other is up giving you availability (A), assuming all clients can normally see both nodes.
- Since you give up on consistency (C), you can put the two nodes in different data centers to gain partition tolerance (P) over the data paths within and between those data centers. This only gets you availability as well if both data centers can be seen by all clients.
In other words, you can tolerate either a loss of a single node or a partition between them, but not both at the same time, and while either condition applies, you cannot guarantee consistency in query replies.
This mode of operation is sometimes called “eventual consistency,” meaning that it’s expected that there will be periods of time where multiple nodes are online but they don’t all respond to identical queries with the same data.
OPTION 2: Require Consistency ----------------------------- In order to get consistency, a 2-node system behaves like it is a bare-minimum quorum in a faux 3-node cluster, with the third node always MIA. As soon as one of the two “remaining” nodes goes down, you have two bad choices:
1. Continue to treat it as a cluster. Since you have no second node to check your transactions against to maintain consistency, the cluster must stop write operations until the downed node is restored. Read-only operations can continue, if we’re willing to accept the risk of the remaining system somehow becoming corrupted without going down entirely.
2. Split the cluster so that the remaining node is a standalone instance, which means you have no distributed system any more, so the CAP theorem no longer applies. You might continue to have C-as-in-ACID, if your software stack is ACID-compliant in the first place, but you lose C-as-in-CAP until the second system comes back up and is restored to full operation.
We can have a CA cluster in one of the two senses above. Only the first is a true cluster, though, and it’s not useful in many applications, since it blocks writes.
CP has no useful meaning in this situation, since a network partition will split the cluster, so that you don’t actually have partition tolerance. This is why the minimum number of nodes is 3: so that a quorum remains on one side of the partition!
Regardless of which path you pick above, you lose something relative to an n >= 2f+1 or an n > 2m+1 cluster. TANSTAAFL.
Lastly, with 2x 2-node, you could lose two nodes (one per cluster) and still be operational. If you lose 2 nodes of a four node cluster, you're offline.
I don’t see how both statements can be true under the same CAP design restrictions, as laid out above. I think if you analyze the configuration of your different scenarios, you’ll find that they’re not in the same CAP regime.
Lacking more information, I’m guessing this 2x2 configuration you’re thinking of is AP, and your 4-min-3 configuration is CP.
A 4-node AP configuration could lose 2 nodes and keep running, as long as all clients can see at least one of the remaining nodes, since you’ve given up on continual consistency.
On Thu, Jul 5, 2018 at 7:10 PM, Digimer lists@alteeve.ca wrote:
First of all thanks for all your answers, all useful in a way or another. I have yet to dig sufficiently deep in Warren considerations, but I will do it, I promise! Very interesting arguments The concerns of Alexander are true in an ideal world, but when your role is to be an IT Consultant and you are not responsible for the budget and for the department, is not so easy to convince about emerging concepts that due to their nature are not so rock solid and accepted (yet). In my work lifetime I had the fortune to be on both the sides of the IT chair and so I think I'm able to see all the points. Eg in 2004 I was the IT Manager of a small company (without responsibility of the budget, I had to convince my CEO at that time; company revenue about 50 million euros) and I did migrate the physical environment to VMware and a Dell CX300 SAN, but it was not so easy, believe in me. I left the company at end of 2007 and the same untouched 3-years old environment ran for other 4 years without any modification or problems. And bare on me, at least in Italy in 2004 it wasn't a so common environment to setup for production.
I always prioritize simplicity and isolation, so I vote fore 2x 2-nodes.
There is no effective benefit to 3+ nodes (quorum is arguably helpful, but proper stonith, which you need anyway, makes it mostly a moot point).
In this particular scenario I run several Oracle RDBMS instances. They are
currently distributed as 3 big ones on the first cluster and other 7 smaller on the other one. With chance to grow up. So in my case I think I can spread better in my opinion the load and have better high availability.
Keep in mind; If your services are critical enough to justify an HA cluster, they're probably important enough that adding the complexity/overhead of larger clusters doesn't offset any hardware efficiency savings.
Probably true for old RHCS stack, based on Cman/Rgmanager. But from various tests it seems Corosync/Pacemaker is much more smooth in managing more than 2 nodes' clusters
Lastly, with 2x 2-node, you could lose two nodes (one per cluster) and still be operational. If you lose 2 nodes of a four node cluster, you're offline.
This is true with default configuration, but you can configure Auto Tie Breaker (ATB) as you can see with "man votequorum", or an example web page here: https://www.systutorials.com/docs/linux/man/5-votequorum/
I just tested and verified it on my virtual 4-nodes based on CentOS 7.4, where I have:
- modified corosync.conf on all nodes - pcs cluster stop --all - pcs cluster start --all - wait a few minutes for resources to start - shutdown cl3 and cl4
and this is the situation at the end, without downtime and with cluster quorate
[root@cl1 ~]# pcs status Cluster name: clorarhv1 Stack: corosync Current DC: intracl2 (version 1.1.16-12.el7_4.8-94ff4df) - partition with quorum Last updated: Sat Jul 7 15:25:47 2018 Last change: Thu Jul 5 18:09:52 2018 by root via crm_resource on intracl2
4 nodes configured 15 resources configured
Online: [ intracl1 intracl2 ] OFFLINE: [ intracl3 intracl4 ]
Full list of resources:
Resource Group: DB1 LV_DB1_APPL (ocf::heartbeat:LVM): Started intracl1 DB1_APPL (ocf::heartbeat:Filesystem): Started intracl1 LV_DB1_CTRL (ocf::heartbeat:LVM): Started intracl1 LV_DB1_DATA (ocf::heartbeat:LVM): Started intracl1 LV_DB1_RDOF (ocf::heartbeat:LVM): Started intracl1 LV_DB1_REDO (ocf::heartbeat:LVM): Started intracl1 LV_DB1_TEMP (ocf::heartbeat:LVM): Started intracl1 DB1_CTRL (ocf::heartbeat:Filesystem): Started intracl1 DB1_DATA (ocf::heartbeat:Filesystem): Started intracl1 DB1_RDOF (ocf::heartbeat:Filesystem): Started intracl1 DB1_REDO (ocf::heartbeat:Filesystem): Started intracl1 DB1_TEMP (ocf::heartbeat:Filesystem): Started intracl1 VIP_DB1 (ocf::heartbeat:IPaddr2): Started intracl1 oracledb_DB1 (ocf::heartbeat:oracle): Started intracl1 oralsnr_DB1 (ocf::heartbeat:oralsnr): Started intracl1
Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@cl1 ~]#
This 4 node scenario seems also suitable to my needs because every one of the current 2-nodes clusters is a stretched one, with one node in site A and the other in site B. The future scenario will see 2 nodes of the new cluster in site A and 2 nodes in site B, so that a failure of a site will compromise 2 nodes, but with the setting above I can provide all the 10 RDBMS services spread between two nodes but allowing me to decide where to put them and not force to only a single node.
BTW: there is also last_man_standing option I can set so that I can also tolerate the loss ot site B and while not yet resolved, the loss of one of the two surviving nodes in site A (in this case possibly I will disable some less critical services or tolerate degraded performances)
In this case the configuration in my case would be (not tested yet):
quorum { provider: corosync_votequorum expected_votes: 4 last_man_standing: 1 auto_tie_breaker: 1 }
Note that it is applied only on node loss and not when a node leaves the cluster in a clean state, for which there is the "allow_downscale" option that seems not fully supported at this moment.
Cheers, Gianluca
On Jul 7, 2018, at 7:54 AM, Gianluca Cecchi gianluca.cecchi@gmail.com wrote:
I have yet to dig sufficiently deep in Warren considerations, but I will do it, I promise! Very interesting arguments
Until then, I offer this tl;dr:
A 4-member cluster has distinct advantages over a 2x2 cluster of the same machines *if* the software running that cluster knows how to take the full available advantage of that configuration.
That “if” is key. By it, I’m telling you I don’t know whether the clustering software you want to run uses a proper distributed consensus algorithm or is something simpler, like a blind mirroring system. If the latter, then the choice of configuration is purely an operational consideration, something only you can decide.
Am 05.07.2018 um 17:27 schrieb Gianluca Cecchi:
Hello, I'm planning migration of current two clusters based on CentOS 6.x with Cman/Rgmanager going to CentOS 7.x and Corosync/Pacemaker.
As the clusters and their services are on the same subnet, and there no particular security concerns differentiating them, I'm also evaluating the option to transform the two clusters into a unique 4-node one during the upgrade.
Currently I'm testing a virtual 4-node CentOS 7.4 cluster inside oVirt 4.2 and things seem to behave well.
Before going further in deep with tests and so on, I'd like to check with the community about how many CentOS 7.x clusters composed by more than two nodes are in place and what are the feedbacks on them in terms of incremented latency/communication, ecc scaling out.
Also general feedback related to CentOS 6 and scalability of cluster nodes number is welcome.
Thanks in advance, Gianluca
From my point of view such classical cluster setups are so 2000s. Outdated by modern infrastructure concepts you see implemented in Kubernetes, OpenShift or cloud solutions in general. It's commonly summarized in the phrase "pets versus cattle". You don't want clusters to be treated as pets. Has always been difficult to maintain.
Obviously I don't know what you run on your old cluster and whether you can migrate to a modern setup instead of replicating it on a current major release. You didn't give us details.
Alexander
On Jul 5, 2018, at 1:07 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
classical cluster setups are so 2000s. Outdated by modern infrastructure concepts you see implemented in Kubernetes, OpenShift or cloud solutions in general. It's commonly summarized in the phrase "pets versus cattle". You don't want clusters to be treated as pets.
That depends on how expensive it is to grow the herd, to extend your metaphor.
For example, if you have a big DBMS, it’s probably much, much faster to maintain a live spare that you can fail over to instantly than it is to spin up a new VM with OpenKuberStack® and clone the complete DBMS over the Internet. At 1 Gbit/sec, every 10 TB costs you about a day of replication time!
The herd-of-cattle model and follow-ons like “serverless” assume you can spin a new one up in a second or so. That ain’t always possible.