Skip to content

Commit

Permalink
core: fix network faults handling and fencing flow
Browse files Browse the repository at this point in the history
This patch fixes network exception handling and fencing flow logic.
Problems in current code:

    1. Hard fencing happens too fast since we waited on number of
attempts <or> grace period, since number of attempts is configured to a
value of "2", grace period was ~20 seconds.

    2. VdsManager::isHostInGracePeriod was called periodically from
VdsManager:handleNetworkExeception and from
SsshSoftFencingCommand::checkIfHostBecomeUp which makes the logic
complex in not working as expected
While we have to handle the network exception grace period when the host
is switched to 'connecting' state due to its load regarding number of
running VMs and SPM status, in the case of soft-fencing flow, the host
is already in not-responding status, other host already took the SPM
role and all its running VMs set to 'unknown' status. So we should not
consider the host load at all and a fixed grace period (configurable 1
min) is enough to restart the vdsmd service on the host and get it up
and running.

Solution was tested with host as SPM with running VMs (some are HA),
with a non SPM host running VMs and with a regular host.

Results:

1. Both initial grace between connecting and non-responding and between
soft-fencing and hard-fencing are honored.

2. Code is more readable and straight foreword

Signed-off-by: Eli Mesika <emesika@redhat.com>
Bug-Url: https://bugzilla.redhat.com/2071468
  • Loading branch information
emesika authored and mwperina committed Apr 25, 2022
1 parent b1579af commit 292e637
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 23 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -119,13 +119,23 @@ private boolean checkIfHostBecomeUp() {
VdsManager vdsManager = getResourceManager().getVdsManager(getVdsId());
long sleepInterval = TimeUnit.SECONDS.toMillis(
Config.<Long> getValue(ConfigValues.VdsRefreshRate));
while (vdsManager.isHostInGracePeriod(true)) {
// Number of VMs running and SPM role are not relevant to grace time
// calculation here, since VMs are marked with unknown status at that point
// and another host will be selected for the SPM role.
long graceTime = TimeUnit.SECONDS.toMillis(
Config.<Integer>getValue(ConfigValues.TimeoutToResetVdsInSeconds));
long passedTime=0;
log.info("Waiting to host {} {} seconds to become up after soft fencing execution",
vdsManager.getCopyVds().getHostName(),
graceTime/100);
while (passedTime <= graceTime) {
if (vdsManager.getCopyVds().getStatus() == VDSStatus.Up) {
// host became Up during grace period
return true;
}
// wait until next host monitoring attempt
ThreadUtils.sleep(sleepInterval);
passedTime += sleepInterval;
}
return false;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -904,7 +904,7 @@ public void handleNetworkException(VDSNetworkException ex) {
}
if (cachedVds.getStatus() != VDSStatus.Down) {
unrespondedAttempts.incrementAndGet();
if (isHostInGracePeriod(false)) {
if (isHostInGracePeriod()) {
if (cachedVds.getStatus() != VDSStatus.Connecting
&& cachedVds.getStatus() != VDSStatus.PreparingForMaintenance
&& cachedVds.getStatus() != VDSStatus.NonResponsive) {
Expand Down Expand Up @@ -963,31 +963,16 @@ private void restartVmsWithLeaseIfNeeded(List<VM> vms) {
}

/**
* Checks if host is in grace period from last successful communication to fencing attempt
* Checks if host is in grace period from last successful communication
*
* @param sshSoftFencingExecuted
* if SSH Soft Fencing was already executed we need to raise default timeout to determine if SSH Soft
* Fencing was successful and host became Up
* @return <code>true</code> if host is still in grace period, otherwise <code>false</code>
*/
public boolean isHostInGracePeriod(boolean sshSoftFencingExecuted) {
long timeoutToFence = calcTimeoutToFence(cachedVds.getVmCount(), cachedVds.getSpmStatus());
public boolean isHostInGracePeriod() {
int unrespondedAttemptsBarrier = Config.<Integer>getValue(ConfigValues.VDSAttemptsToResetCount);

if (sshSoftFencingExecuted) {
// SSH Soft Fencing has already been executed, increase timeout to see if host is OK
timeoutToFence = timeoutToFence * 2;
unrespondedAttemptsBarrier = unrespondedAttemptsBarrier * 2;
}
// return when either attempts reached or timeout passed, the sooner takes
if (unrespondedAttempts.get() > unrespondedAttemptsBarrier) {
// too many unresponded attempts
return false;
} else if ((lastUpdate + timeoutToFence) > System.currentTimeMillis()) {
// timeout since last successful communication attempt passed
return false;
}
return true;
long timeToFence = calcTimeoutToFence(cachedVds.getVmCount(),
cachedVds.getSpmStatus());
return unrespondedAttempts.get() <= unrespondedAttemptsBarrier || lastUpdate + timeToFence >=
System.currentTimeMillis();
}

private void logHostFailToRespond(VDSNetworkException ex) {
Expand Down

0 comments on commit 292e637

Please sign in to comment.