Digimer wrote:
On 16/06/14 02:19 PM, John R Pierce wrote:
On 6/16/2014 10:55 AM, Digimer wrote:
The main downside to fabric fencing is that the failed node will have no chance of recovering without human intervention. Further, it places the onus on the admin to not simply unfence the node without first doing proper cleanup/recovery. For these reasons, I always recommend power fencing (IPMI, PDUs, etc).
how does power fencing change your first 2 statements in any fashion ? as I see it, it would make manual recovery even harder, as you couldn't even power up the failed system without first disconnecting it from the network
When I have used network fencing, I've left the admin ports live, that way, the operator can access the system console to find out WHY it is fubar, and put it in a proper state for recovery. of course, this implies you have several LAN connections, which is always a good idea for a clustered system anyways.
Most power fencing methods are set to "reboot", which is "off -> verify -> try to boot", with the "try to boot" part not effecting success of the overall fence call. In my experience (dozens of clusters going back to 2009), this has always left the nodes booted, save for cases where the node itself had totally failed. I also do not start the cluster on boot in most cases, so the node is there and waiting for an admin to login, in a clean state (no concept of cluster state in memory, thanks to the reboot).
If you're curious, this is how I build my clusters. It also goes into length on the fencing topology and rationale:
One can also set the cluster nodes to failover, and when the failed node comes up, to *not* try to take back the services, leaving it in a state for you to fix it.
mark, first work on h/a clusters 1997-2001