Skip to content

Commit

Permalink
Uncordon the node during failed updates
Browse files Browse the repository at this point in the history
Today we cordon the node before we write updates to the node. This
means that if a file write fails (e.g. failed to create a directory),
we fail the update but the node stays cordoned. This will cause
deadlocks as the node annotation for desired config will no longer
be updated.

With the rollback added, if you delete the erroneous machineconfig
in question, we will be able to auto-recover from failed writes,
like we do for failed reconciliation. The side effect of this is
that the node will flip between Ready and Ready,Unschedulable,
since each time we receive a node event we will attempt to update
again and go through the full process.

Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>
  • Loading branch information
yuqi-zhang committed Mar 19, 2020
1 parent 14b5472 commit 8ee8efc
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions pkg/daemon/update.go
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,15 @@ func (dn *Daemon) update(oldConfig, newConfig *mcfgv1.MachineConfig) (retErr err
return err
}

defer func() {
if retErr != nil {
if err := drain.RunCordonOrUncordon(dn.drainer, dn.node, false); err != nil {
retErr = errors.Wrapf(retErr, "error rolling back cordon on the node: %v", err)
return
}
}
}()

// update files on disk that need updating
if err := dn.updateFiles(oldConfig, newConfig); err != nil {
return err
Expand Down

0 comments on commit 8ee8efc

Please sign in to comment.