Uncordon the node during failed updates

Today we cordon the node before we write updates to the node. This means that if a file write fails (e.g. failed to create a directory), we fail the update but the node stays cordoned. This will cause deadlocks as the node annotation for desired config will no longer be updated. With the rollback added, if you delete the erroneous machineconfig in question, we will be able to auto-recover from failed writes, like we do for failed reconciliation. The side effect of this is that the node will flip between Ready and Ready,Unschedulable, since each time we receive a node event we will attempt to update again and go through the full process. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>
openshift · Mar 19, 2020 · 8ee8efc · 8ee8efc
1 parent 14b5472
commit 8ee8efc
Showing 1 changed file with 9 additions and 0 deletions.
diff --git a/pkg/daemon/update.go b/pkg/daemon/update.go
@@ -279,6 +279,15 @@ func (dn *Daemon) update(oldConfig, newConfig *mcfgv1.MachineConfig) (retErr err
 		return err
 	}
 
+	defer func() {
+		if retErr != nil {
+			if err := drain.RunCordonOrUncordon(dn.drainer, dn.node, false); err != nil {
+				retErr = errors.Wrapf(retErr, "error rolling back cordon on the node: %v", err)
+				return
+			}
+		}
+	}()
+
 	// update files on disk that need updating
 	if err := dn.updateFiles(oldConfig, newConfig); err != nil {
 		return err