Fix LXD lock-up on concurrent cluster joins. #12571

masnax · 2023-11-29T00:52:12Z

When lots of nodes are joining the dqlite cluster concurrently, there's a good chance that some of these nodes will reach out to the leader in the middle of a role-change. go-dqlite will fail to add the node to the cluster, and LXD doesn't recover cleanly from this, resulting in LXD on the joining node basically fully locking up, and being unrecoverable on a restart as well.

go-dqlite errors out here because the underlying request to dqlite is reporting a SQLITE_BUSY error with the message a configuration change is already in progress. We can infer from this error (and also go-dqlite's own handling of cluster joins) that we should keep trying to join the cluster until it either succeeds or we receive some other error.

To that end, this PR adds a clusterBusyError sentinel error that is checked when an attempt to join the cluster fails. We will keep retrying every 1 second for 1 minute with the existing context until the join succeeds, or errors out with a more terminal error.

masnax · 2023-11-29T00:53:31Z

This also means that we'll need to look at recovering from cluster join failures without needing to nuke the joiner node.

tomponline · 2023-11-29T06:41:05Z

This also means that we'll need to look at recovering from cluster join failures without needing to nuke the joiner node.

Could you clarify whether you mean that because of this PR it means we need to add support for recovering from cluster join failure, or rather that in addition to this PR we need to look into that?

tomponline · 2023-11-29T06:43:11Z

lxd/cluster/membership.go

+// clusterBusyError is returned by dqlite if attempting attempting to join a cluster at the same time as a role-change.
+// This error tells us we can retry and probably join the cluster or fail due to something else.
+// The error code here is sqlite3.ErrBusy.
+var clusterBusyError = fmt.Errorf("a configuration change is already in progress (5)")


Is this a go-dqlite or underlying dqlite error?
If so we sglukd use type detection (possibly with dqlite error code matching) rather than string matching.

There's an example of this in lxd already I'll dig out for you.

It's from dqlite.

Well, if it helps its a protocol.ErrRequest from go-dqlite but the error message and code are from dqlite

Something like this:

// Insert a new Storage Bucket record. result, err := tx.tx.Exec(` INSERT INTO storage_buckets (storage_pool_id, node_id, name, description, project_id) VALUES (?, ?, ?, ?, (SELECT id FROM projects WHERE name = ?)) `, poolID, nodeID, info.Name, info.Description, projectName) if err != nil { var dqliteErr dqliteDriver.Error // Detect SQLITE_CONSTRAINT_UNIQUE (2067) errors. if errors.As(err, &dqliteErr) && dqliteErr.Code == 2067 { return api.StatusErrorf(http.StatusConflict, "A bucket for that name already exists") } return err }

From https://github.com/canonical/lxd/blob/main/lxd/db/storage_buckets.go#L304-L312

Your hint led me to find

lxd/lxd/db/query/retry.go

Lines 55 to 88 in fdc3dc6

// IsRetriableError returns true if the given error might be transient and the

// interaction can be safely retried.

func IsRetriableError(err error) bool {

var dErr *driver.Error

if errors.As(err, &dErr) && dErr.Code == driver.ErrBusy {

return true

}

if errors.Is(err, sqlite3.ErrLocked) || errors.Is(err, sqlite3.ErrBusy) {

return true

}

// Unwrap errors one at a time.

for ; err != nil; err = errors.Unwrap(err) {

if strings.Contains(err.Error(), "database is locked") {

return true

}

if strings.Contains(err.Error(), "cannot start a transaction within a transaction") {

return true

}

if strings.Contains(err.Error(), "bad connection") {

return true

}

if strings.Contains(err.Error(), "checkpoint in progress") {

return true

}

}

return false

}

which looks like it handles several retriable errors from dqlite, including the one we want in this caseErrBusy.

Not till tests run :)

I wonder if @MathieuBordere or @cole-miller could help with accessing this error.

Ok so the issue is that packages from https://github.com/canonical/go-dqlite/blob/4dfb00574f2633c7153e4d8e622016510448f4c0/driver/driver.go#L812 will be cast as a driver.Error but in this case we're dealing with an error from a client function here https://github.com/canonical/go-dqlite/blob/4dfb00574f2633c7153e4d8e622016510448f4c0/client/client.go#L171 which is left as the internal protocol.ErrRequest type defined here https://github.com/canonical/go-dqlite/blob/4dfb00574f2633c7153e4d8e622016510448f4c0/internal/protocol/errors.go#L18

Reverting this to use the hard-coded error message for now.

Sorry for the delay in responding. Could we make this more convenient for LXD by making sure a driver.Error is returned here?

tomponline

Good spot thanks!

masnax · 2023-11-29T06:52:44Z

This also means that we'll need to look at recovering from cluster join failures without needing to nuke the joiner node.

Could you clarify whether you mean that because of this PR it means we need to add support for recovering from cluster join failure, or rather that in addition to this PR we need to look into that?

The latter. Basically what I learned is that if LXD errors out at this point, before or after this PR, the node is left broken. The rest of the cluster should be fine, but a node that failed to join should revert back to a state where we can ask it to join again, and currently this is not the case.

tomponline · 2023-11-29T07:55:47Z

The latter. Basically what I learned is that if LXD errors out at this point, before or after this PR, the node is left broken. The rest of the cluster should be fine, but a node that failed to join should revert back to a state where we can ask it to join again, and currently this is not the case.

Thanks, please can you open a GH issue for this.

tomponline

Lets detect the error code directly rather than using a string.

tomponline · 2023-12-05T16:45:04Z

@masnax please rebase

When lots of nodes are joining the dqlite cluster concurrently, there's a good chance that some of these nodes will reach out to the leader in the middle of a role-change. Dqlite reports that a role change has occcurred with a SQLITE_BUSY error from which we can infer that we should keep trying to join the cluster until it either succeeds or we receive some other error. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

masnax requested a review from tomponline as a code owner November 29, 2023 00:52

tomponline reviewed Nov 29, 2023

View reviewed changes

tomponline requested changes Nov 29, 2023

View reviewed changes

masnax force-pushed the fix-dqlite branch 2 times, most recently from bd6fb14 to 3e78cd9 Compare December 4, 2023 21:38

tomponline previously approved these changes Dec 4, 2023

View reviewed changes

masnax marked this pull request as draft December 4, 2023 21:46

masnax marked this pull request as ready for review December 4, 2023 22:17

masnax dismissed tomponline’s stale review via 0c39e6a December 4, 2023 22:22

masnax force-pushed the fix-dqlite branch from 3e78cd9 to 0c39e6a Compare December 4, 2023 22:22

masnax force-pushed the fix-dqlite branch from 0c39e6a to 0655571 Compare December 5, 2023 16:46

tomponline approved these changes Dec 5, 2023

View reviewed changes

tomponline merged commit d2d01a9 into canonical:main Dec 5, 2023
25 checks passed

masnax mentioned this pull request Dec 6, 2023

Clean up after cluster join errors #12624

Closed

masnax mentioned this pull request May 3, 2024

microcloud/cmd/microcloud: Remove dqlite role shift timeouts canonical/microcloud#303

Merged

masnax mentioned this pull request Nov 7, 2024

dqlite: Use go-dqlite LTS and build libdqlite from LTS branch canonical/microcluster#272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LXD lock-up on concurrent cluster joins. #12571

Fix LXD lock-up on concurrent cluster joins. #12571

masnax commented Nov 29, 2023 •

edited

Loading

masnax commented Nov 29, 2023

tomponline commented Nov 29, 2023 •

edited

Loading

tomponline Nov 29, 2023

masnax Nov 29, 2023

masnax Nov 29, 2023

tomponline Nov 29, 2023

masnax Dec 4, 2023

tomponline Dec 4, 2023 •

edited

Loading

tomponline Dec 4, 2023

masnax Dec 4, 2023

masnax Dec 4, 2023

cole-miller Dec 5, 2023

tomponline left a comment

masnax commented Nov 29, 2023

tomponline commented Nov 29, 2023

tomponline left a comment

tomponline commented Dec 5, 2023

	// IsRetriableError returns true if the given error might be transient and the
	// interaction can be safely retried.
	func IsRetriableError(err error) bool {
	var dErr *driver.Error

	if errors.As(err, &dErr) && dErr.Code == driver.ErrBusy {
	return true
	}

	if errors.Is(err, sqlite3.ErrLocked) \|\| errors.Is(err, sqlite3.ErrBusy) {
	return true
	}

	// Unwrap errors one at a time.
	for ; err != nil; err = errors.Unwrap(err) {
	if strings.Contains(err.Error(), "database is locked") {
	return true
	}

	if strings.Contains(err.Error(), "cannot start a transaction within a transaction") {
	return true
	}

	if strings.Contains(err.Error(), "bad connection") {
	return true
	}

	if strings.Contains(err.Error(), "checkpoint in progress") {
	return true
	}
	}

	return false
	}

Fix LXD lock-up on concurrent cluster joins. #12571

Fix LXD lock-up on concurrent cluster joins. #12571

Conversation

masnax commented Nov 29, 2023 • edited Loading

masnax commented Nov 29, 2023

tomponline commented Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomponline Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomponline left a comment

Choose a reason for hiding this comment

masnax commented Nov 29, 2023

tomponline commented Nov 29, 2023

tomponline left a comment

Choose a reason for hiding this comment

tomponline commented Dec 5, 2023

masnax commented Nov 29, 2023 •

edited

Loading

tomponline commented Nov 29, 2023 •

edited

Loading

tomponline Dec 4, 2023 •

edited

Loading