Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert database on joining node if cluster join fails #12811

Merged
merged 4 commits into from
Feb 3, 2024

Conversation

masnax
Copy link
Contributor

@masnax masnax commented Feb 2, 2024

Closes #12624

Joining dqlite happens in the Join function under lxd/cluster/membership.go where we lock the global database, create raft node entries in the local database, and reload the gateway. If there's an issue on the dqlite end that prevents us from joining the cluster, we error out and return, but the database is still locked, and the gateway will still behave as though the node has joined the cluster. This leaves the node in an unrecoverable state as the node will perpetually wait for for connections to a cluster that it's not a part of.

To fix that, this adds some revert hooks that ensure the node returns to a state where we can tell it to re-join the cluster with a new join token. In particular, the revert hooks clear out the raft nodes in the local database, refresh the gateway again to pick this up, and unlocks the global database.

The previous error message didn't exist anymore. A good example of why
not to use hard-coded error messages :)

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@masnax
Copy link
Contributor Author

masnax commented Feb 2, 2024

Looks like the linter didn't like this one too much.

I've fixed most of the issues, but I had to add an exception for deep-exit which limits calls to os.Exit() to the main and init functions. When we remove a node from a cluster, it tries to re-exec itself, but we just exit if we detect that LXD is using systemd socket activation. Not sure if there's a cleaner way to reconcile this.

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test for this?

@tomponline tomponline merged commit ab1a421 into canonical:main Feb 3, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up after cluster join errors
2 participants