[CentOS] Question about clustering

Tue Jun 17 14:32:50 UTC 2014
Digimer <lists at alteeve.ca>

On 17/06/14 10:23 AM, Denniston, Todd A CIV NAVSURFWARCENDIV Crane wrote:
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Monday, June 16, 2014 3:20 PM
>> To: CentOS mailing list
>> Subject: Re: [CentOS] Question about clustering
>>
>> On 16/06/14 02:55 PM, m.roth at 5-cent.us wrote:
> <SNIP>
>>> One can also set the cluster nodes to failover, and when the failed node
>>> comes up, to *not* try to take back the services, leaving it in a state
>>> for you to fix it.
>>>
>>>           mark, first work on h/a clusters 1997-2001
>>
>> Failover and recovery are secondary to fencing. The surviving node(s)
>> can't begin recovery until the lost node is in a known state. To make an
>> assumption about the node's state (by, for example, assuming that no
>> access to the node is sufficient to determine it is off) is to risk a
>> split-brain. Even something as relatively "minor" as a floating IP can
>> potentially cause problems with ARP, for example.
>>
>> Cheers
>
> Having operated a file serving cluster for a few years (~2001-2006) without ANY fencing device, I can tell you that it causes split-brain in the admins too, i.e., I AGREE.

To which I can use the analogy that in the 18 years I've driven a car, 
I've never needed my seat belt or airbags. I still put my seatbelt on 
every time I go anywhere though, and I won't buy a car without airbags. ;)

> Earlier, Alessandro Baggi wrote:
>> there is a chance to make fencing without hardware, but only software?
> To which Digimer, answered: No. <SNIP info about fence device independence>
>
> However, there is an *Almost* software only fence.

If you goal is high-availability, there is a strong argument that 
"almost" isn't enough.

>   Unfortunately  for me I learned about (or at least understood) the stonith devices late in the above system's life.  I expect even meatware stonith[1]  could have saved me considerable pain five or six times.

Manual fencing was dropped as a supported fence method in RHEL 6 because 
it was too prone to human mistakes. When an HA cluster is hung and an 
admin who might not have touched the cluster in months has users and 
managers yelling at them, mistakes with potentially massive consequences 
happen.

Manual fencing is just not safe.

> Understand that I am not recommending meatware stonith to be a good operational stonith device, see [2] for how much subtle understanding the meat has to have, but it would be much better than NO operational stonith device.

Bingo on the meat, disagree on "no stonith" at all. A cluster must have 
fencing.

> [1] http://clusterlabs.org/doc/crm_fencing.html#_meatware
> [2] http://oss.clusterlabs.org/pipermail/pacemaker/2011-June/010693.html
>
> Even when this disclaimer is not here:
> I am not a contracting officer. I do not have authority to make or modify the terms of any contract.

Cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?