Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Leader election failure to restart #783

Merged
merged 11 commits into from
Jun 28, 2024

Conversation

exextatic
Copy link
Contributor

@exextatic exextatic commented Jun 26, 2024

Fixes #677.

This pull request fixes the following behavioral issues noted when testing a leader-aware operator with transient network issues:

  • In LeaderElectionBackgroundService, if elector.RunUntilLeadershipLostAsync() throws, the exception is not observed in the library and no further attempts to become the leader occur. The library now logs any unexpected exceptions and tries to become the leader again.
  • A leader could not stop and then subsequently start being a leader once more due to cancellation token sources not being recreated. The library now disposes and recreates the cancellation token sources as required.
  • LeaderAwareResourceWatcher<TEntity>.StoppedLeading would erroneously pass a cancelled cancellation token to ResourceWatcher<TEntity>. The library now passes the IHostApplicationLifetime.ApplicationStopped token to the ResourceWatcher<TEntity> - we can assume that ApplicationStopped is a good indication that the stop should no longer be graceful.

Due to the nature of these issues I've only been able to add test coverage in the form of unit tests.
I've tested these changes in practice using clumsy and an AKS cluster.

@buehler
Copy link
Owner

buehler commented Jun 27, 2024

Thank you @exextatic for your contribution!

It's fine to have unit tests, testing network loss is pretty hard to setup, as such I trust your message that you tested it ;-)

The linter is complaining about some white spaces though. Can you run dotnet format and push the commit again? I'd love to merge these changes.

@exextatic
Copy link
Contributor Author

Thank you @exextatic for your contribution!

It's fine to have unit tests, testing network loss is pretty hard to setup, as such I trust your message that you tested it ;-)

The linter is complaining about some white spaces though. Can you run dotnet format and push the commit again? I'd love to merge these changes.

Thank you for looking at the changes @buehler.
I've amended the whitespace formatting issue in my latest commit.

@buehler buehler enabled auto-merge (squash) June 28, 2024 08:22
@buehler buehler merged commit e0523ca into buehler:main Jun 28, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: LeaderAwareResourceWatcher does not regain leadership after network issues
2 participants