🐛 Fix a race condition between leader election and recorder #1379

vincepri · 2021-02-09T17:55:06Z

This change introduces better syncronization between the leader election
code and the event recorder. Running tests with -race flag, we often saw
a panic on a closed channel, the channel was the one that the event
recorder was using internally.

After digging more through the code, it seems that we weren't properly
waiting for leader election code to stop completely, but instead we were
only calling the cancel() function asking the leader election to stop.

With this change, during a shutdown, we now wait for leader election to
finish up any internal task before we return and close an internal
channel. Only after leader election signals that the channel has been
closed, and Run(...) has properly returned, we return execution to the
stop procedure, where the event recorder is then stopped.

Signed-off-by: Vince Prignano vincepri@vmware.com

This change introduces better syncronization between the leader election code and the event recorder. Running tests with -race flag, we often saw a panic on a closed channel, the channel was the one that the event recorder was using internally. After digging more through the code, it seems that we weren't properly waiting for leader election code to stop completely, but instead we were only calling the cancel() function asking the leader election to stop. With this change, during a shutdown, we now wait for leader election to finish up any internal task before we return and close an internal channel. Only after leader election signals that the channel has been closed, and Run(...) has properly returned, we return execution to the stop procedure, where the event recorder is then stopped. Signed-off-by: Vince Prignano <vincepri@vmware.com>

vincepri · 2021-02-09T17:55:13Z

/milestone v0.9.0

k8s-ci-robot · 2021-02-09T17:55:14Z

@vincepri: The provided milestone is not valid for this repository. Milestones in this repository: [1.0.0, Next, v0.5.x, v0.6.x, v0.7.x, v0.8.x, v0.9.x]

Use /milestone clear to clear the milestone.

In response to this:

/milestone v0.9.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2021-02-09T17:55:17Z

/milestone v0.9.x

vincepri · 2021-02-09T17:55:30Z

/assign @alvaroaleman @christopherhein @DirectXMan12

christopherhein · 2021-02-09T22:45:07Z

Nice, this looks good to me, I will wait for others to chime in.

alvaroaleman · 2021-02-10T02:58:37Z

pkg/manager/internal.go

 			cm.leaderElectionCancel()
+			<-cm.leaderElectionStopped


One concern (Probably more about the unconditional leaderElectionCancel() call than your change):
If we attempt to shut down gracefully and the Runnables do not end in time, they will end up running after we are not elected leader anymore (since we now block until that happens).

Maybe cancel the leader election only if the rest of the func has a nil error or an error that is not context.Cancelled?

You mean checking the return err and only if it's not nil, cancel leader election? The error comes from shutdownCtx, if we don't cancel leader election, I'd assume that it'll timeout after a bit and the main.go should exit, right?

Trying to double check the thinking, I'd also like to capture in a comment around the err check

only if its nil, cancel leader election, if its not-nil, that means something is still running and its unsafe to cancel leader election because that might result in the thing still running after we are not leader anymore.

This should result in the binary exiting which exists all pending routines and the leader lease of us will just time out.

Signed-off-by: Vince Prignano <vincepri@vmware.com>

vincepri · 2021-02-10T19:30:31Z

@alvaroaleman ptal

alvaroaleman

/lgtm

k8s-ci-robot · 2021-02-10T19:33:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alvaroaleman,vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 9, 2021

k8s-ci-robot requested review from droot and mengqiy February 9, 2021 17:55

k8s-ci-robot added this to the v0.9.x milestone Feb 9, 2021

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 9, 2021

k8s-ci-robot assigned alvaroaleman, christopherhein and DirectXMan12 Feb 9, 2021

alvaroaleman reviewed Feb 10, 2021

View reviewed changes

Only cancel leader election if the runnables have shutdown

c326e7a

Signed-off-by: Vince Prignano <vincepri@vmware.com>

alvaroaleman approved these changes Feb 10, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2021

k8s-ci-robot merged commit 16bf3ad into kubernetes-sigs:master Feb 10, 2021

vincepri mentioned this pull request Feb 10, 2021

🐛 [relase-0.8] Fix a race condition between leader election and recorder #1381

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix a race condition between leader election and recorder #1379

🐛 Fix a race condition between leader election and recorder #1379

vincepri commented Feb 9, 2021

vincepri commented Feb 9, 2021

k8s-ci-robot commented Feb 9, 2021

vincepri commented Feb 9, 2021

vincepri commented Feb 9, 2021

christopherhein commented Feb 9, 2021

alvaroaleman Feb 10, 2021

vincepri Feb 10, 2021

vincepri Feb 10, 2021

alvaroaleman Feb 10, 2021

vincepri commented Feb 10, 2021

alvaroaleman left a comment

k8s-ci-robot commented Feb 10, 2021

🐛 Fix a race condition between leader election and recorder #1379

🐛 Fix a race condition between leader election and recorder #1379

Conversation

vincepri commented Feb 9, 2021

vincepri commented Feb 9, 2021

k8s-ci-robot commented Feb 9, 2021

vincepri commented Feb 9, 2021

vincepri commented Feb 9, 2021

christopherhein commented Feb 9, 2021

alvaroaleman Feb 10, 2021

Choose a reason for hiding this comment

vincepri Feb 10, 2021

Choose a reason for hiding this comment

vincepri Feb 10, 2021

Choose a reason for hiding this comment

alvaroaleman Feb 10, 2021

Choose a reason for hiding this comment

vincepri commented Feb 10, 2021

alvaroaleman left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 10, 2021