On 16/06/14 02:55 PM, m.roth at 5-cent.us wrote: > Digimer wrote: >> On 16/06/14 02:19 PM, John R Pierce wrote: >>> On 6/16/2014 10:55 AM, Digimer wrote: >>>> The main downside to fabric fencing is that the failed node will have >>>> no >>>> chance of recovering without human intervention. Further, it places the >>>> onus on the admin to not simply unfence the node without first doing >>>> proper cleanup/recovery. For these reasons, I always recommend power >>>> fencing (IPMI, PDUs, etc). >>> >>> how does power fencing change your first 2 statements in any fashion ? >>> as I see it, it would make manual recovery even harder, as you couldn't >>> even power up the failed system without first disconnecting it from the >>> network >>> >>> When I have used network fencing, I've left the admin ports live, that >>> way, the operator can access the system console to find out WHY it is >>> fubar, and put it in a proper state for recovery. of course, this >>> implies you have several LAN connections, which is always a good idea >>> for a clustered system anyways. >> >> Most power fencing methods are set to "reboot", which is "off -> verify >> -> try to boot", with the "try to boot" part not effecting success of >> the overall fence call. In my experience (dozens of clusters going back >> to 2009), this has always left the nodes booted, save for cases where >> the node itself had totally failed. I also do not start the cluster on >> boot in most cases, so the node is there and waiting for an admin to >> login, in a clean state (no concept of cluster state in memory, thanks >> to the reboot). >> >> If you're curious, this is how I build my clusters. It also goes into >> length on the fencing topology and rationale: >> >> https://alteeve.ca/w/AN!Cluster_Tutorial_2 > > One can also set the cluster nodes to failover, and when the failed node > comes up, to *not* try to take back the services, leaving it in a state > for you to fix it. > > mark, first work on h/a clusters 1997-2001 Failover and recovery are secondary to fencing. The surviving node(s) can't begin recovery until the lost node is in a known state. To make an assumption about the node's state (by, for example, assuming that no access to the node is sufficient to determine it is off) is to risk a split-brain. Even something as relatively "minor" as a floating IP can potentially cause problems with ARP, for example. Cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?