execinfra: add retries to outbox DialNoBreaker attempts #52624

asubiotto · 2020-08-11T12:57:12Z

This connection attempt is retried since failure to connect to an inbox results
in a query error. In the past, we have seen cases where a gateway node, n1,
would send a flow request to n2, but n2 would be unable to connect back to n1.
Retrying these connection attempts alleviates these flakes and causes no impact
to the end user, since the receiver at the other end will hang for
SettingFlowStreamTimeout waiting for a successful connection attempt.

Release note (sql change): "no inbound stream connection" errors should happen
less frequently due to the addition of connection retries

Fixes #50000

cockroach-teamcity · 2020-08-11T12:57:19Z

This change is

andreimatei · 2020-08-11T13:12:19Z

Do you know why the first connection attempt fails and the 2nd succeeds? If this is happening after n1 restarted I'm thinking maybe it's about races with the circuit breaker on n2 that still thinks n1 is down? We've talked in the past as using process ids as epochs for purposes of connection establishment - so something like n1 communicating its new process the id to n2 which would then reset the circuit breaker, or generally reset all its connections.

andreimatei · 2020-08-11T13:15:10Z

What I'm saying is #44101

asubiotto · 2020-08-11T13:19:36Z

Do you know why the first connection attempt fails and the 2nd succeeds

No, but I see a BreakerReset event a couple hundred milliseconds after the first connection attempt. The thing is that we don't use circuit breakers (we use DialNoBreaker) so any connection issue is probably (I think) a grpc thing.

andreimatei · 2020-08-11T13:28:41Z

The "no breaker" part of DialNoBreaker I think affects what happens if there's no connection open (or opening?) already, but if there is one then that connection is returned. My money is still on some sort of a race that would benefit from us knowing that the prior connection is no good because it was going to the wrong process (i.e. we started opening it before we found out about the new process).

asubiotto · 2020-08-11T14:12:42Z

You're right, there's still some breaker business in Dialer.dial. I also think we're probably incorrectly reusing some invalid connection. I think this PR is good to have regardless, since we shouldn't be making only one dial attempt for such a critical part of query execution.

yuzefovich

LGTM but I'll defer to Andrei.

Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

andreimatei

I think this PR is good to have regardless, since we shouldn't be making only one dial attempt for such a critical part of query execution.

I don't think I like this PR, but I don't feel very strongly so feel free to go on without me.

Whether or not I like this PR taken in isolation I think comes down for me on whether it helps badness that can be observed under nice restarts (SIGQUIT) versus brutal ones (SIGKILL). The test in #50000 does roachprod stop, which is SIGKILL (by default). I don't think I buy that "we shouldn't be making only one dial attempt for such a critical part of query execution", since we generally fail DistSQL queries on network errors; there's general no tolerance for them, so why should there be any during flow setup? One possible answer is that flow setup catches errors from a time window before the query, whereas connection error at other times during the query catch failures that actually happened during the query (so the former case would be much more likely cause it covers more time). That's where the SIGQUIT/SIGKILL (drain/no drain) distinction comes in: if these errors happen even when we drain nodes (I hope they wouldn't) then I'd buy it.

But what I really want is to not take this in isolation and instead see if we can address #44101 more systematically. I'm afraid that papering over it with this patch (if you agree with that framing) would make it more unlikely for a better fix to happen.
I'm aware that I myself have failed to try to solve #44101 when I implemented #44102 (which disables distribution for startup migrations), but I sleep at night telling myself that that patch has more arguments going for it for standing in place regardless of a fix for #44101 than this one. It could be a case of authorship bias, though :)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

This connection attempt is retried since failure to connect to an inbox results in a query error. In the past, we have seen cases where a gateway node, n1, would send a flow request to n2, but n2 would be unable to connect back to n1. Retrying these connection attempts alleviates these flakes and causes no impact to the end user, since the receiver at the other end will hang for SettingFlowStreamTimeout waiting for a successful connection attempt. Release note (sql change): "no inbound stream connection" errors should happen less frequently due to the addition of connection retries

asubiotto · 2020-08-12T09:00:50Z

I don't think I buy that "we shouldn't be making only one dial attempt for such a critical part of query execution", since we generally fail DistSQL queries on network errors; there's general no tolerance for them, so why should there be any during flow setup?

I think we might as well do it in this case because we have a "free" 10-second window in which the server on the other side will be waiting for a connection attempt anyway, so why not take advantage of it if it means less query failures? I don't think the fact that we're not resilient to network failures during other parts of query setup should mean that we refuse to add retries in this case. I would also argue that this network failure is a lot more annoying than a network failure during another part of query setup. Sure, a fix to #44101 could address this, but why not be defensive?

But what I really want is to not take this in isolation and instead see if we can address #44101 more systematically. I'm afraid that papering over it with this patch (if you agree with that framing) would make it more unlikely for a better fix to happen.

I agree that we need a systematic fix for #44101, but it's been seven months since that issue was created and I'm not sure what the priority is for a fix even without this patch, but I don't think it's high. It seems that it's already unlikely so I prefer making the situation better now than waiting around for that to happen.

asubiotto · 2020-08-12T13:00:14Z

bors r=yuzefovich

craig · 2020-08-12T13:25:50Z

Build succeeded:

Compile Build (Cockroach)

asubiotto requested review from andreimatei and yuzefovich August 11, 2020 12:57

This was referenced Aug 11, 2020

server: disable DistSQL in server's InternalExecutor until migrations complete #51053

Closed

roachtest: acceptance/version-upgrade failed because distsql is unexpectedly used #50000

Closed

asubiotto requested a review from a team August 11, 2020 13:05

irfansharif mentioned this pull request Aug 11, 2020

roachtest: acceptance/version-upgrade failed consistency check #52627

Closed

yuzefovich reviewed Aug 11, 2020

View reviewed changes

andreimatei reviewed Aug 11, 2020

View reviewed changes

craig bot merged commit 09bb646 into cockroachdb:master Aug 12, 2020

asubiotto deleted the obrl branch August 12, 2020 13:30

aayushshah15 mentioned this pull request Oct 7, 2020

sql: no inbound stream connection error on TPCC running on large cluster #55297

Closed

yuzefovich mentioned this pull request Sep 2, 2022

roachtest: acceptance/version-upgrade is flaky #87104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

execinfra: add retries to outbox DialNoBreaker attempts #52624

execinfra: add retries to outbox DialNoBreaker attempts #52624

asubiotto commented Aug 11, 2020

cockroach-teamcity commented Aug 11, 2020

andreimatei commented Aug 11, 2020

andreimatei commented Aug 11, 2020

asubiotto commented Aug 11, 2020

andreimatei commented Aug 11, 2020

asubiotto commented Aug 11, 2020

yuzefovich left a comment

andreimatei left a comment

asubiotto commented Aug 12, 2020

asubiotto commented Aug 12, 2020

craig bot commented Aug 12, 2020

execinfra: add retries to outbox DialNoBreaker attempts #52624

execinfra: add retries to outbox DialNoBreaker attempts #52624

Conversation

asubiotto commented Aug 11, 2020

cockroach-teamcity commented Aug 11, 2020

andreimatei commented Aug 11, 2020

andreimatei commented Aug 11, 2020

asubiotto commented Aug 11, 2020

andreimatei commented Aug 11, 2020

asubiotto commented Aug 11, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

asubiotto commented Aug 12, 2020

asubiotto commented Aug 12, 2020

craig bot commented Aug 12, 2020