On 22/06/16 02:12 PM, Chris Adams wrote:
Once upon a time, Digimer lists@alteeve.ca said:
The cluster software and any hosted services aren't running. It's not that they think they're wrong, they just have no existing state so they won't try to touch anything without first ensuring it is safe to do so.
Well, I was being short; what I meant was, in HA, if you aren't known to be right, you are wrong, and you do nothing.
Ah, yes, exactly right.
SCSI reservations, and anything that blocks access is technically OK. However, I stand by the recommendation to power cycle lost nodes. It's by far the safest (and easiest) approach. I know this goes against the grain of sysadmins to yank power, but in an HA setup, nodes should be disposable and replaceable. The nodes are not important, the hosted services are.
One advantage SCSI reservations have is that if you can access the storage, you can lock out everybody else. It doesn't require access to a switch, management card, etc. (that may have its own problems). If you can access the storage, you own it, if you can't, you don't. Putting a lock directly on the actual shared resource can be the safest path (if you can't access it, you can't screw it up).
I agree that rebooting a failed node is also good, just pointing out that putting the lock directly on the shared resource is also good.
The SCSI reservation protects shared storage only, which is my main concern. A lot of folks think that fencing is only needed for storage, when it is needed for all HA'ed services. If you know what you're doing though, particularly if you combine it with watchdog based fencing like fence_sanlock, you can be in good shape (if very very slow fencing times are OK for you).
In the end though, I personally always use IPMI as the primary fence method with a pair of switched PDUs as my backup method. Brutal, Simple and highly effective. :P