-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LXD lock-up on concurrent cluster joins. #12571
Conversation
This also means that we'll need to look at recovering from cluster join failures without needing to nuke the joiner node. |
Could you clarify whether you mean that because of this PR it means we need to add support for recovering from cluster join failure, or rather that in addition to this PR we need to look into that? |
// clusterBusyError is returned by dqlite if attempting attempting to join a cluster at the same time as a role-change. | ||
// This error tells us we can retry and probably join the cluster or fail due to something else. | ||
// The error code here is sqlite3.ErrBusy. | ||
var clusterBusyError = fmt.Errorf("a configuration change is already in progress (5)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a go-dqlite or underlying dqlite error?
If so we sglukd use type detection (possibly with dqlite error code matching) rather than string matching.
There's an example of this in lxd already I'll dig out for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from dqlite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if it helps its a protocol.ErrRequest
from go-dqlite
but the error message and code are from dqlite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this:
// Insert a new Storage Bucket record.
result, err := tx.tx.Exec(`
INSERT INTO storage_buckets
(storage_pool_id, node_id, name, description, project_id)
VALUES (?, ?, ?, ?, (SELECT id FROM projects WHERE name = ?))
`, poolID, nodeID, info.Name, info.Description, projectName)
if err != nil {
var dqliteErr dqliteDriver.Error
// Detect SQLITE_CONSTRAINT_UNIQUE (2067) errors.
if errors.As(err, &dqliteErr) && dqliteErr.Code == 2067 {
return api.StatusErrorf(http.StatusConflict, "A bucket for that name already exists")
}
return err
}
From https://github.com/canonical/lxd/blob/main/lxd/db/storage_buckets.go#L304-L312
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your hint led me to find
Lines 55 to 88 in fdc3dc6
// IsRetriableError returns true if the given error might be transient and the | |
// interaction can be safely retried. | |
func IsRetriableError(err error) bool { | |
var dErr *driver.Error | |
if errors.As(err, &dErr) && dErr.Code == driver.ErrBusy { | |
return true | |
} | |
if errors.Is(err, sqlite3.ErrLocked) || errors.Is(err, sqlite3.ErrBusy) { | |
return true | |
} | |
// Unwrap errors one at a time. | |
for ; err != nil; err = errors.Unwrap(err) { | |
if strings.Contains(err.Error(), "database is locked") { | |
return true | |
} | |
if strings.Contains(err.Error(), "cannot start a transaction within a transaction") { | |
return true | |
} | |
if strings.Contains(err.Error(), "bad connection") { | |
return true | |
} | |
if strings.Contains(err.Error(), "checkpoint in progress") { | |
return true | |
} | |
} | |
return false | |
} |
ErrBusy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not till tests run :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if @MathieuBordere or @cole-miller could help with accessing this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so the issue is that packages from https://github.com/canonical/go-dqlite/blob/4dfb00574f2633c7153e4d8e622016510448f4c0/driver/driver.go#L812 will be cast as a driver.Error
but in this case we're dealing with an error from a client function here https://github.com/canonical/go-dqlite/blob/4dfb00574f2633c7153e4d8e622016510448f4c0/client/client.go#L171 which is left as the internal protocol.ErrRequest
type defined here https://github.com/canonical/go-dqlite/blob/4dfb00574f2633c7153e4d8e622016510448f4c0/internal/protocol/errors.go#L18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting this to use the hard-coded error message for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in responding. Could we make this more convenient for LXD by making sure a driver.Error
is returned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot thanks!
The latter. Basically what I learned is that if LXD errors out at this point, before or after this PR, the node is left broken. The rest of the cluster should be fine, but a node that failed to join should revert back to a state where we can ask it to join again, and currently this is not the case. |
Thanks, please can you open a GH issue for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets detect the error code directly rather than using a string.
bd6fb14
to
3e78cd9
Compare
@masnax please rebase |
When lots of nodes are joining the dqlite cluster concurrently, there's a good chance that some of these nodes will reach out to the leader in the middle of a role-change. Dqlite reports that a role change has occcurred with a SQLITE_BUSY error from which we can infer that we should keep trying to join the cluster until it either succeeds or we receive some other error. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
When lots of nodes are joining the dqlite cluster concurrently, there's a good chance that some of these nodes will reach out to the leader in the middle of a role-change.
go-dqlite
will fail to add the node to the cluster, and LXD doesn't recover cleanly from this, resulting in LXD on the joining node basically fully locking up, and being unrecoverable on a restart as well.go-dqlite errors out here because the underlying request to dqlite is reporting a
SQLITE_BUSY
error with the messagea configuration change is already in progress
. We can infer from this error (and alsogo-dqlite
's own handling of cluster joins) that we should keep trying to join the cluster until it either succeeds or we receive some other error.To that end, this PR adds a
clusterBusyError
sentinel error that is checked when an attempt to join the cluster fails. We will keep retrying every 1 second for 1 minute with the existing context until the join succeeds, or errors out with a more terminal error.