Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LXD lock-up on concurrent cluster joins. #12571

Merged
merged 1 commit into from
Dec 5, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 26 additions & 3 deletions lxd/cluster/membership.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@ import (
"github.com/canonical/lxd/shared/version"
)

// clusterBusyError is returned by dqlite if attempting attempting to join a cluster at the same time as a role-change.
// This error tells us we can retry and probably join the cluster or fail due to something else.
// The error code here is SQLITE_BUSY.
var clusterBusyError = fmt.Errorf("a configuration change is already in progress (5)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a go-dqlite or underlying dqlite error?
If so we sglukd use type detection (possibly with dqlite error code matching) rather than string matching.

There's an example of this in lxd already I'll dig out for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from dqlite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it helps its a protocol.ErrRequest from go-dqlite but the error message and code are from dqlite

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

// Insert a new Storage Bucket record.
		result, err := tx.tx.Exec(`
		INSERT INTO storage_buckets
		(storage_pool_id, node_id, name, description, project_id)
		VALUES (?, ?, ?, ?, (SELECT id FROM projects WHERE name = ?))
		`, poolID, nodeID, info.Name, info.Description, projectName)
		if err != nil {
			var dqliteErr dqliteDriver.Error
			// Detect SQLITE_CONSTRAINT_UNIQUE (2067) errors.
			if errors.As(err, &dqliteErr) && dqliteErr.Code == 2067 {
				return api.StatusErrorf(http.StatusConflict, "A bucket for that name already exists")
			}

			return err
		}

From https://github.com/canonical/lxd/blob/main/lxd/db/storage_buckets.go#L304-L312

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your hint led me to find

// IsRetriableError returns true if the given error might be transient and the
// interaction can be safely retried.
func IsRetriableError(err error) bool {
var dErr *driver.Error
if errors.As(err, &dErr) && dErr.Code == driver.ErrBusy {
return true
}
if errors.Is(err, sqlite3.ErrLocked) || errors.Is(err, sqlite3.ErrBusy) {
return true
}
// Unwrap errors one at a time.
for ; err != nil; err = errors.Unwrap(err) {
if strings.Contains(err.Error(), "database is locked") {
return true
}
if strings.Contains(err.Error(), "cannot start a transaction within a transaction") {
return true
}
if strings.Contains(err.Error(), "bad connection") {
return true
}
if strings.Contains(err.Error(), "checkpoint in progress") {
return true
}
}
return false
}
which looks like it handles several retriable errors from dqlite, including the one we want in this caseErrBusy.

Copy link
Member

@tomponline tomponline Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not till tests run :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if @MathieuBordere or @cole-miller could help with accessing this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting this to use the hard-coded error message for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in responding. Could we make this more convenient for LXD by making sure a driver.Error is returned here?


// Bootstrap turns a non-clustered LXD instance into the first (and leader)
// node of a new LXD cluster.
//
Expand Down Expand Up @@ -431,9 +436,27 @@ func Join(state *state.State, gateway *Gateway, networkCert *shared.CertInfo, se
logger.Info("Adding node to cluster", logger.Ctx{"id": info.ID, "local": info.Address, "role": info.Role})
ctx, cancel = context.WithTimeout(context.Background(), time.Minute)
defer cancel()
err = client.Add(ctx, info.NodeInfo)
if err != nil {
return fmt.Errorf("Failed to join cluster: %w", err)

// Repeatedly try to join in case the cluster is busy with a role-change.
joined := false
for !joined {
select {
case <-ctx.Done():
return fmt.Errorf("Failed to join cluster: %w", ctx.Err())
default:
err = client.Add(ctx, info.NodeInfo)
if err != nil && err.Error() == clusterBusyError.Error() {
// If the cluster is busy with a role change, sleep a second and then keep trying to join.
time.Sleep(1 * time.Second)
continue
}

if err != nil {
return fmt.Errorf("Failed to join cluster: %w", err)
}

joined = true
}
}

// Make sure we can actually connect to the cluster database through
Expand Down
Loading