[CentOS] KVM HA

Wed Jun 22 06:27:57 UTC 2016

On 22/06/16 02:03 AM, Indunil Jayasooriya wrote:
> On Wed, Jun 22, 2016 at 11:08 AM, Barak Korren <bkorren at redhat.com> wrote:
> 
>>>
>>> My question is: Is this even possible? All the documentation for HA that
>> I've found appears to not
>>> do this. Am I missing something?
>>
>> You can use oVirt for that (www.ovirt.org).
>>
> 
> When an UNCLEAN SHUDWON happens or ifdown eth0 in node1 ,  can OVIRT
> migrate VMs from node1 to node2?
> 
> in that case, Is power management such as ILO needed?

I can't speak to ovirt (it's more of a cloud platform than an HA one),
but in HA in general, this is how it works...

Say node1 is hosting vm-A. Node1 stops responding for some reason (maybe
it's hung, maybe it's running by lost net, maybe it's a pile of flaming
rubble, you don't know). Within a moment, the other cluster node(s) will
declare it lost and initiate fencing.

Typically "fencing" means "shut the target off over IPMI (iRMC, iLO,
DRAC, RSA, etc). However, lets assume that the node lost all power
(we've seen this with voltage regulators failing on the mainboard,
shorted cable harnesses, etc). In that case, the IPMI BMC will fail as
well so this method of fencing will fail.

The cluster can't assume that "no response from fence device A == dead
node". All you know is that you still don't know what state the peer is
in. To make assumption and boot vm-A now would be potentially
disastrous. So instead, the cluster blocks and retries the fencing
indefinitely, leaving things hung. The logic being that, as bad as it is
to hang, it is better than risking a split-brain/corruption.

What we do to mitigate this, and pacemaker supports this just fine, is
add a second layer for fencing. We do this with a pair of switched PDUs.
So say that node1's first PSU is plugged into PDU 1, Outlet 1 and its
second PSU is plugged into PDU 2, Outlet 1. Now, instead of blocking
after IPMI fails, it instead moves on and turns off the power to those
two outlets. Being that the PDUs are totally external, they should be
up. So in this case, now we can say "yes, node1 is gone" and safely boot
vm-A on node2.

Make sense?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?