-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make connection killing resilient to MySQL hangs #14500
Conversation
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Hello! 👋 This Pull Request is now handled by arewefastyet. The current HEAD and future commits will be benchmarked. You can find the performance comparison on the arewefastyet website. |
resultChan := make(chan *sqltypes.Result, 1) | ||
errChan := make(chan error, 1) | ||
|
||
if done != nil { | ||
close(done) | ||
wg.Wait() | ||
startTime := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making channel on runtime for every call seems expensive to me.
I have added Benchmark me
label to it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no worse than the current code, which already creates a channel and a waitgroup. The current code, however, is very wasteful. So I think we can improve this to make this even less costly! Let me try...
@arthurschreiber: can you apply 996e652 to this PR? It significantly reduces all allocations we perform and should make us faster than the previous code, while still being correct |
@vmg Thanks! I'm on it. 🙇♂️ |
@vmg Just for my understanding, the improvements come from not copying the current query string, and by creating one channel instead of two? |
Yeah, the current string is obviously unrelated to your PR, I just noticed that we were using The changes in the actual connection path boil down to always creating a single channel instead of two. I also noticed we were fetching the current time twice (one for the metrics and one for the potential There are ways to make this zero allocation but I don't think we should attempt that for now. |
Signed-off-by: Vicent Marti <vmg@strn.cat>
I worked on tests to fully cover this code, but the tests are using |
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
This PR was closed because it has been stale for 7 days with no activity. |
Signed-off-by: Daniel Joos <danieljoos@github.com>
38b8018
to
ee52f63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for adding the tests!
I think the error in CI is related to the changes here, but looking at the test, it may just not be properly written/testing the right thing. Can you give it a look @arthurschreiber? |
@vmg I kicked the tests off again. I could not see any relationship between the failures and the changes in this PR - did I miss something? 🤔 |
Oh, there is a different set of CI statuses on the commit and on the PR? Weird. I'm taking a look. |
Ok, so from what I understand the failures in the |
…/fix-conn-kill-hang Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
I think I figured it out. The test case had an implicit dependency on the "incorrect" behaviour of I fixed it by making all 3 queries / transactions wait explicitly for the |
Looks great! I figured the test was making wrong assumptions. Thanks again @arthurschreiber |
To be backported to v19? |
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Signed-off-by: Vicent Marti <vmg@strn.cat> Signed-off-by: Daniel Joos <danieljoos@github.com> Co-authored-by: Vicent Marti <vmg@strn.cat> Co-authored-by: Daniel Joos <danieljoos@github.com>
Sorry we missed this. The release is going out today. In future the way to get an auto-backport is to add a backport label before the PR is merged. I realize you don't have the permissions to do that, but @arthurschreiber does. |
We can probs just add the label to my 2nd PR, if it's wanted in 18.0 or 17.0. I don't personally care to see this backported earlier than 19.0. ;) |
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Signed-off-by: Vicent Marti <vmg@strn.cat> Signed-off-by: Daniel Joos <danieljoos@github.com> Co-authored-by: Vicent Marti <vmg@strn.cat> Co-authored-by: Daniel Joos <danieljoos@github.com>
Make connection killing resilient to MySQL hangs (vitessio#14500)
Mh. I hoped adding backport labels will open backport PRs, but it doesn't look like that's working. :/ I'll open the backport PRs manually then. |
I don't think this was ever backported to v19 either? |
So I'm sure sure the vitess team will accept a v17 or v18 backport. My v19 backport was rejected: #15411. |
Description
This fixes some issues around
*Conn.Kill
that can be encountered when MySQL "hangs" (e.g. either by sending it aSIGSTOP
signal or when MySQL is writing a core dump file after a crash).Kill
has no context / timeout support at all. This means that when MySQL hangs, and some other code tries to send theKILL <id>
statement to MySQL (e.g. due tosetDeadline
or because the code tries to callKill
directly), theKill
function is just going to hang as well.I added a new function
KillWithContext
that allows passing in aContext
object and correctly honors cancellation. I replacedsetDeadline
with direct calls toKillWithContext
, with a 5 second timeout. I don't think a 5 second timeout actually makes any sense - butsetDeadline
intended to use a timeout of2*elapsed + 5*time.Second
(which never actually worked). BecauseKILL <id>
on MySQL is supposed to return instantly, I think a much lower timeout would be sufficient.In case the context passed to
KillWithContext
is canceled / timed out, there's nothing we can do - the client side connection on which we want to kill the executing statement was already closed anyway, and sending theKILL <id>
statement is more or less best effort. So we just close the kill connection and return.TODO
Kill
and either switch them over to pass aContext
to higher level methods likeExec
, or switch them to useKillWithContext
instead.Related Issue(s)
Checklist
Deployment Notes