Skip to content

Commit

Permalink
🐛 Fix a race condition between leader election and recorder
Browse files Browse the repository at this point in the history
This change introduces better syncronization between the leader election
code and the event recorder. Running tests with -race flag, we often saw
a panic on a closed channel, the channel was the one that the event
recorder was using internally.

After digging more through the code, it seems that we weren't properly
waiting for leader election code to stop completely, but instead we were
only calling the cancel() function asking the leader election to stop.

With this change, during a shutdown, we now wait for leader election to
finish up any internal task before we return and close an internal
channel. Only after leader election signals that the channel has been
closed, and  Run(...) has properly returned, we return execution to the
stop procedure, where the event recorder is then stopped.

Signed-off-by: Vince Prignano <vincepri@vmware.com>
  • Loading branch information
vincepri committed Feb 10, 2021
1 parent a763c9a commit fa1ed34
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 1 deletion.
15 changes: 14 additions & 1 deletion pkg/manager/internal.go
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,10 @@ type controllerManager struct {
// it must be deferred until after gracefulShutdown is done.
leaderElectionCancel context.CancelFunc

// leaderElectionStopped is an internal channel used to signal the stopping procedure that the
// LeaderElection.Run(...) function has returned and the shutdown can proceed.
leaderElectionStopped chan struct{}

// stop procedure engaged. In other words, we should not add anything else to the manager
stopProcedureEngaged bool

Expand Down Expand Up @@ -549,7 +553,12 @@ func (cm *controllerManager) waitForRunnableToEnd(shutdownCancel context.CancelF
// Cancel leader election only after we waited. It will os.Exit() the app for safety.
defer func() {
if cm.leaderElectionCancel != nil {
// After asking the context to be cancelled, make sure
// we wait for the leader stopped channel to be closed, otherwise
// we might encounter race conditions between this code
// and the event recorder, which is used within leader election code.
cm.leaderElectionCancel()
<-cm.leaderElectionStopped
}
}()

Expand Down Expand Up @@ -652,7 +661,11 @@ func (cm *controllerManager) startLeaderElection() (err error) {
}

// Start the leader elector process
go l.Run(ctx)
go func() {
l.Run(ctx)
<-ctx.Done()
close(cm.leaderElectionStopped)
}()
return nil
}

Expand Down
1 change: 1 addition & 0 deletions pkg/manager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,7 @@ func New(config *rest.Config, options Options) (Manager, error) {
livenessEndpointName: options.LivenessEndpointName,
gracefulShutdownTimeout: *options.GracefulShutdownTimeout,
internalProceduresStop: make(chan struct{}),
leaderElectionStopped: make(chan struct{}),
}, nil
}

Expand Down

0 comments on commit fa1ed34

Please sign in to comment.