[CentOS] Question about clustering

Mon Jun 16 19:19:59 UTC 2014
Digimer <lists at alteeve.ca>

On 16/06/14 02:55 PM, m.roth at 5-cent.us wrote:
> Digimer wrote:
>> On 16/06/14 02:19 PM, John R Pierce wrote:
>>> On 6/16/2014 10:55 AM, Digimer wrote:
>>>> The main downside to fabric fencing is that the failed node will have
>>>> no
>>>> chance of recovering without human intervention. Further, it places the
>>>> onus on the admin to not simply unfence the node without first doing
>>>> proper cleanup/recovery. For these reasons, I always recommend power
>>>> fencing (IPMI, PDUs, etc).
>>>
>>> how does power fencing change your first 2 statements in any fashion ?
>>> as I see it, it would make manual recovery even harder, as you couldn't
>>> even power up the failed system without first disconnecting it from the
>>> network
>>>
>>> When I have used network fencing, I've left the admin ports live, that
>>> way, the operator can access the system console to find out WHY it is
>>> fubar, and put it in a proper state for recovery.   of course, this
>>> implies you have several LAN connections, which is always a good idea
>>> for a clustered system anyways.
>>
>> Most power fencing methods are set to "reboot", which is "off -> verify
>> -> try to boot", with the "try to boot" part not effecting success of
>> the overall fence call. In my experience (dozens of clusters going back
>> to 2009), this has always left the nodes booted, save for cases where
>> the node itself had totally failed. I also do not start the cluster on
>> boot in most cases, so the node is there and waiting for an admin to
>> login, in a clean state (no concept of cluster state in memory, thanks
>> to the reboot).
>>
>> If you're curious, this is how I build my clusters. It also goes into
>> length on the fencing topology and rationale:
>>
>> https://alteeve.ca/w/AN!Cluster_Tutorial_2
>
> One can also set the cluster nodes to failover, and when the failed node
> comes up, to *not* try to take back the services, leaving it in a state
> for you to fix it.
>
>          mark, first work on h/a clusters 1997-2001

Failover and recovery are secondary to fencing. The surviving node(s) 
can't begin recovery until the lost node is in a known state. To make an 
assumption about the node's state (by, for example, assuming that no 
access to the node is sufficient to determine it is off) is to risk a 
split-brain. Even something as relatively "minor" as a floating IP can 
potentially cause problems with ARP, for example.

Cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?