Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execinfra: add retries to outbox DialNoBreaker attempts #52624

Merged
merged 1 commit into from
Aug 12, 2020
Merged

execinfra: add retries to outbox DialNoBreaker attempts #52624

merged 1 commit into from
Aug 12, 2020

Conversation

asubiotto
Copy link
Contributor

This connection attempt is retried since failure to connect to an inbox results
in a query error. In the past, we have seen cases where a gateway node, n1,
would send a flow request to n2, but n2 would be unable to connect back to n1.
Retrying these connection attempts alleviates these flakes and causes no impact
to the end user, since the receiver at the other end will hang for
SettingFlowStreamTimeout waiting for a successful connection attempt.

Release note (sql change): "no inbound stream connection" errors should happen
less frequently due to the addition of connection retries

Fixes #50000

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor

Do you know why the first connection attempt fails and the 2nd succeeds? If this is happening after n1 restarted I'm thinking maybe it's about races with the circuit breaker on n2 that still thinks n1 is down? We've talked in the past as using process ids as epochs for purposes of connection establishment - so something like n1 communicating its new process the id to n2 which would then reset the circuit breaker, or generally reset all its connections.

@andreimatei
Copy link
Contributor

What I'm saying is #44101

@asubiotto
Copy link
Contributor Author

Do you know why the first connection attempt fails and the 2nd succeeds

No, but I see a BreakerReset event a couple hundred milliseconds after the first connection attempt. The thing is that we don't use circuit breakers (we use DialNoBreaker) so any connection issue is probably (I think) a grpc thing.

@andreimatei
Copy link
Contributor

The "no breaker" part of DialNoBreaker I think affects what happens if there's no connection open (or opening?) already, but if there is one then that connection is returned. My money is still on some sort of a race that would benefit from us knowing that the prior connection is no good because it was going to the wrong process (i.e. we started opening it before we found out about the new process).

@asubiotto
Copy link
Contributor Author

You're right, there's still some breaker business in Dialer.dial. I also think we're probably incorrectly reusing some invalid connection. I think this PR is good to have regardless, since we shouldn't be making only one dial attempt for such a critical part of query execution.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'll defer to Andrei.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good to have regardless, since we shouldn't be making only one dial attempt for such a critical part of query execution.

I don't think I like this PR, but I don't feel very strongly so feel free to go on without me.

Whether or not I like this PR taken in isolation I think comes down for me on whether it helps badness that can be observed under nice restarts (SIGQUIT) versus brutal ones (SIGKILL). The test in #50000 does roachprod stop, which is SIGKILL (by default). I don't think I buy that "we shouldn't be making only one dial attempt for such a critical part of query execution", since we generally fail DistSQL queries on network errors; there's general no tolerance for them, so why should there be any during flow setup? One possible answer is that flow setup catches errors from a time window before the query, whereas connection error at other times during the query catch failures that actually happened during the query (so the former case would be much more likely cause it covers more time). That's where the SIGQUIT/SIGKILL (drain/no drain) distinction comes in: if these errors happen even when we drain nodes (I hope they wouldn't) then I'd buy it.

But what I really want is to not take this in isolation and instead see if we can address #44101 more systematically. I'm afraid that papering over it with this patch (if you agree with that framing) would make it more unlikely for a better fix to happen.
I'm aware that I myself have failed to try to solve #44101 when I implemented #44102 (which disables distribution for startup migrations), but I sleep at night telling myself that that patch has more arguments going for it for standing in place regardless of a fix for #44101 than this one. It could be a case of authorship bias, though :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

This connection attempt is retried since failure to connect to an inbox results
in a query error. In the past, we have seen cases where a gateway node, n1,
would send a flow request to n2, but n2 would be unable to connect back to n1.
Retrying these connection attempts alleviates these flakes and causes no impact
to the end user, since the receiver at the other end will hang for
SettingFlowStreamTimeout waiting for a successful connection attempt.

Release note (sql change): "no inbound stream connection" errors should happen
less frequently due to the addition of connection retries
@asubiotto
Copy link
Contributor Author

I don't think I buy that "we shouldn't be making only one dial attempt for such a critical part of query execution", since we generally fail DistSQL queries on network errors; there's general no tolerance for them, so why should there be any during flow setup?

I think we might as well do it in this case because we have a "free" 10-second window in which the server on the other side will be waiting for a connection attempt anyway, so why not take advantage of it if it means less query failures? I don't think the fact that we're not resilient to network failures during other parts of query setup should mean that we refuse to add retries in this case. I would also argue that this network failure is a lot more annoying than a network failure during another part of query setup. Sure, a fix to #44101 could address this, but why not be defensive?

But what I really want is to not take this in isolation and instead see if we can address #44101 more systematically. I'm afraid that papering over it with this patch (if you agree with that framing) would make it more unlikely for a better fix to happen.

I agree that we need a systematic fix for #44101, but it's been seven months since that issue was created and I'm not sure what the priority is for a fix even without this patch, but I don't think it's high. It seems that it's already unlikely so I prefer making the situation better now than waiting around for that to happen.

@asubiotto
Copy link
Contributor Author

bors r=yuzefovich

@craig
Copy link
Contributor

craig bot commented Aug 12, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: acceptance/version-upgrade failed because distsql is unexpectedly used
4 participants