Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LXD lock-up on concurrent cluster joins. #12571

Merged
merged 1 commit into from
Dec 5, 2023

Conversation

masnax
Copy link
Contributor

@masnax masnax commented Nov 29, 2023

When lots of nodes are joining the dqlite cluster concurrently, there's a good chance that some of these nodes will reach out to the leader in the middle of a role-change. go-dqlite will fail to add the node to the cluster, and LXD doesn't recover cleanly from this, resulting in LXD on the joining node basically fully locking up, and being unrecoverable on a restart as well.

go-dqlite errors out here because the underlying request to dqlite is reporting a SQLITE_BUSY error with the message a configuration change is already in progress. We can infer from this error (and also go-dqlite's own handling of cluster joins) that we should keep trying to join the cluster until it either succeeds or we receive some other error.

To that end, this PR adds a clusterBusyError sentinel error that is checked when an attempt to join the cluster fails. We will keep retrying every 1 second for 1 minute with the existing context until the join succeeds, or errors out with a more terminal error.

@masnax
Copy link
Contributor Author

masnax commented Nov 29, 2023

This also means that we'll need to look at recovering from cluster join failures without needing to nuke the joiner node.

@tomponline
Copy link
Member

tomponline commented Nov 29, 2023

This also means that we'll need to look at recovering from cluster join failures without needing to nuke the joiner node.

Could you clarify whether you mean that because of this PR it means we need to add support for recovering from cluster join failure, or rather that in addition to this PR we need to look into that?

// clusterBusyError is returned by dqlite if attempting attempting to join a cluster at the same time as a role-change.
// This error tells us we can retry and probably join the cluster or fail due to something else.
// The error code here is sqlite3.ErrBusy.
var clusterBusyError = fmt.Errorf("a configuration change is already in progress (5)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a go-dqlite or underlying dqlite error?
If so we sglukd use type detection (possibly with dqlite error code matching) rather than string matching.

There's an example of this in lxd already I'll dig out for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from dqlite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it helps its a protocol.ErrRequest from go-dqlite but the error message and code are from dqlite

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

// Insert a new Storage Bucket record.
		result, err := tx.tx.Exec(`
		INSERT INTO storage_buckets
		(storage_pool_id, node_id, name, description, project_id)
		VALUES (?, ?, ?, ?, (SELECT id FROM projects WHERE name = ?))
		`, poolID, nodeID, info.Name, info.Description, projectName)
		if err != nil {
			var dqliteErr dqliteDriver.Error
			// Detect SQLITE_CONSTRAINT_UNIQUE (2067) errors.
			if errors.As(err, &dqliteErr) && dqliteErr.Code == 2067 {
				return api.StatusErrorf(http.StatusConflict, "A bucket for that name already exists")
			}

			return err
		}

From https://github.com/canonical/lxd/blob/main/lxd/db/storage_buckets.go#L304-L312

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your hint led me to find

// IsRetriableError returns true if the given error might be transient and the
// interaction can be safely retried.
func IsRetriableError(err error) bool {
var dErr *driver.Error
if errors.As(err, &dErr) && dErr.Code == driver.ErrBusy {
return true
}
if errors.Is(err, sqlite3.ErrLocked) || errors.Is(err, sqlite3.ErrBusy) {
return true
}
// Unwrap errors one at a time.
for ; err != nil; err = errors.Unwrap(err) {
if strings.Contains(err.Error(), "database is locked") {
return true
}
if strings.Contains(err.Error(), "cannot start a transaction within a transaction") {
return true
}
if strings.Contains(err.Error(), "bad connection") {
return true
}
if strings.Contains(err.Error(), "checkpoint in progress") {
return true
}
}
return false
}
which looks like it handles several retriable errors from dqlite, including the one we want in this caseErrBusy.

Copy link
Member

@tomponline tomponline Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not till tests run :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if @MathieuBordere or @cole-miller could help with accessing this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting this to use the hard-coded error message for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in responding. Could we make this more convenient for LXD by making sure a driver.Error is returned here?

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot thanks!

@masnax
Copy link
Contributor Author

masnax commented Nov 29, 2023

This also means that we'll need to look at recovering from cluster join failures without needing to nuke the joiner node.

Could you clarify whether you mean that because of this PR it means we need to add support for recovering from cluster join failure, or rather that in addition to this PR we need to look into that?

The latter. Basically what I learned is that if LXD errors out at this point, before or after this PR, the node is left broken. The rest of the cluster should be fine, but a node that failed to join should revert back to a state where we can ask it to join again, and currently this is not the case.

@tomponline
Copy link
Member

The latter. Basically what I learned is that if LXD errors out at this point, before or after this PR, the node is left broken. The rest of the cluster should be fine, but a node that failed to join should revert back to a state where we can ask it to join again, and currently this is not the case.

Thanks, please can you open a GH issue for this.

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets detect the error code directly rather than using a string.

@masnax masnax force-pushed the fix-dqlite branch 2 times, most recently from bd6fb14 to 3e78cd9 Compare December 4, 2023 21:38
tomponline
tomponline previously approved these changes Dec 4, 2023
@masnax masnax marked this pull request as draft December 4, 2023 21:46
@masnax masnax marked this pull request as ready for review December 4, 2023 22:17
@tomponline
Copy link
Member

@masnax please rebase

When lots of nodes are joining the dqlite cluster concurrently, there's
a good chance that some of these nodes will reach out to the leader in
the middle of a role-change. Dqlite reports that a role change has
occcurred with a SQLITE_BUSY error from which we can infer that
we should keep trying to join the cluster until it either succeeds or we
receive some other error.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants