Revert database on joining node if cluster join fails #12811
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #12624
Joining dqlite happens in the
Join
function underlxd/cluster/membership.go
where we lock the global database, create raft node entries in the local database, and reload the gateway. If there's an issue on the dqlite end that prevents us from joining the cluster, we error out and return, but the database is still locked, and the gateway will still behave as though the node has joined the cluster. This leaves the node in an unrecoverable state as the node will perpetually wait for for connections to a cluster that it's not a part of.To fix that, this adds some revert hooks that ensure the node returns to a state where we can tell it to re-join the cluster with a new join token. In particular, the revert hooks clear out the raft nodes in the local database, refresh the gateway again to pick this up, and unlocks the global database.