-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
execinfra: add retries to outbox DialNoBreaker attempts #52624
Conversation
Do you know why the first connection attempt fails and the 2nd succeeds? If this is happening after n1 restarted I'm thinking maybe it's about races with the circuit breaker on n2 that still thinks n1 is down? We've talked in the past as using process ids as epochs for purposes of connection establishment - so something like n1 communicating its new process the id to n2 which would then reset the circuit breaker, or generally reset all its connections. |
What I'm saying is #44101 |
No, but I see a |
The "no breaker" part of |
You're right, there's still some breaker business in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I'll defer to Andrei.
Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is good to have regardless, since we shouldn't be making only one dial attempt for such a critical part of query execution.
I don't think I like this PR, but I don't feel very strongly so feel free to go on without me.
Whether or not I like this PR taken in isolation I think comes down for me on whether it helps badness that can be observed under nice restarts (SIGQUIT) versus brutal ones (SIGKILL). The test in #50000 does roachprod stop
, which is SIGKILL
(by default). I don't think I buy that "we shouldn't be making only one dial attempt for such a critical part of query execution", since we generally fail DistSQL queries on network errors; there's general no tolerance for them, so why should there be any during flow setup? One possible answer is that flow setup catches errors from a time window before the query, whereas connection error at other times during the query catch failures that actually happened during the query (so the former case would be much more likely cause it covers more time). That's where the SIGQUIT/SIGKILL (drain/no drain) distinction comes in: if these errors happen even when we drain nodes (I hope they wouldn't) then I'd buy it.
But what I really want is to not take this in isolation and instead see if we can address #44101 more systematically. I'm afraid that papering over it with this patch (if you agree with that framing) would make it more unlikely for a better fix to happen.
I'm aware that I myself have failed to try to solve #44101 when I implemented #44102 (which disables distribution for startup migrations), but I sleep at night telling myself that that patch has more arguments going for it for standing in place regardless of a fix for #44101 than this one. It could be a case of authorship bias, though :)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
This connection attempt is retried since failure to connect to an inbox results in a query error. In the past, we have seen cases where a gateway node, n1, would send a flow request to n2, but n2 would be unable to connect back to n1. Retrying these connection attempts alleviates these flakes and causes no impact to the end user, since the receiver at the other end will hang for SettingFlowStreamTimeout waiting for a successful connection attempt. Release note (sql change): "no inbound stream connection" errors should happen less frequently due to the addition of connection retries
I think we might as well do it in this case because we have a "free" 10-second window in which the server on the other side will be waiting for a connection attempt anyway, so why not take advantage of it if it means less query failures? I don't think the fact that we're not resilient to network failures during other parts of query setup should mean that we refuse to add retries in this case. I would also argue that this network failure is a lot more annoying than a network failure during another part of query setup. Sure, a fix to #44101 could address this, but why not be defensive?
I agree that we need a systematic fix for #44101, but it's been seven months since that issue was created and I'm not sure what the priority is for a fix even without this patch, but I don't think it's high. It seems that it's already unlikely so I prefer making the situation better now than waiting around for that to happen. |
bors r=yuzefovich |
Build succeeded: |
This connection attempt is retried since failure to connect to an inbox results
in a query error. In the past, we have seen cases where a gateway node, n1,
would send a flow request to n2, but n2 would be unable to connect back to n1.
Retrying these connection attempts alleviates these flakes and causes no impact
to the end user, since the receiver at the other end will hang for
SettingFlowStreamTimeout waiting for a successful connection attempt.
Release note (sql change): "no inbound stream connection" errors should happen
less frequently due to the addition of connection retries
Fixes #50000