Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: FlowStream error to recently restarted node #87634

Open
tbg opened this issue Sep 8, 2022 · 4 comments
Open

rpc: FlowStream error to recently restarted node #87634

tbg opened this issue Sep 8, 2022 · 4 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Sep 8, 2022

Describe the problem

Extracted from an acceptance/version-upgrade failure1.
We're likely deflaking this test by inserting a sleep but should investigate at some point.

To Reproduce

Potentially by stressing version-upgrade at the SHA of 1

Expected behavior

When a node has signaled readiness, it ought to be able to serve the FlowStream RPC. It's unclear if the problem is in the "other node's" RPC layer (is it holding on and using a connection to the peer from before the restart, that hasn't yet been torn down?) or whether the node is somehow serving requests before it is truly ready to do so, resulting in transient errors.

Artifacts attached.

debug.zip
artifacts.zip

Note: putting this into Replication team because the KV area nominally owns RPC and I started looking into this.

Jira issue: CRDB-19450

Epic CRDB-39899

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/87104#issuecomment-1240906865 2

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Sep 8, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 8, 2022

cc @cockroachdb/replication

@tbg
Copy link
Member Author

tbg commented Sep 8, 2022

I think it would be helpful to reproduce this with the env var GRPC_GO_LOG_SEVERITY_LEVEL=info (or changing this1 to info).

But I think this has to do with how the connection pool works. Not that it should work that way, but I think somehow what happens is that we have a connection in the pool (on n1, the not-restarted-node in the artifacts) that still looks "healthy" even though grpc's first dial has already hit an error (guessing: connection refused). The conn is then handed out here and survives the error check:

conn, err := execinfra.GetConnForOutbox(ctx, dialer, sqlInstanceID, connectionTimeout)
if err != nil {
log.Warningf(
ctx,
"Outbox Dial connection error, distributed query will fail: %+v",
err,
)
return err
}

but when we actually try to make a FlowStream client and connect, we prompt a redial (or something like that) and run into the onlyOnceDialer (which is something that we use internally to make sure that gRPC always lets our rpc wrappers do a new connection; this has to do with version checks etc) here:

client := execinfrapb.NewDistSQLClient(conn)
// We use the flow context for the RPC so that when outbox context is
// canceled in case of a graceful shutdown, the gRPC stream keeps on
// running. If, however, the flow context is canceled, then the
// termination of the whole query is ungraceful, so we're ok with the
// gRPC stream being ungracefully shutdown too.
stream, err = client.FlowStream(flowCtx)
if err != nil {
log.Warningf(
ctx,
"Outbox FlowStream connection error, distributed query will fail: %+v",
err,
)
return err
}

This doesn't all obviously check out, since there is a health check when pulling the connection from the pool,

// Check to see if the connection is in the transient failure state. This can
// happen if the connection already existed, but a recent heartbeat has
// failed and we haven't yet torn down the connection.
if err := grpcutil.ConnectionReady(conn); err != nil {
err = errors.Wrapf(err, "failed to check for ready connection to n%d at %v", nodeID, addr)
if breaker != nil {
breaker.Fail(err)
}
return nil, err
}

and I don't know why we're getting past it.

It might be helpful to not try to reproduce this the full version-upgrade test, but simply a regular test where we repeatedly drain and instantly restart a node, then run a distributed query on it.

Or maybe even better, add some ad-hoc code to a build that verifies an invariant that we want: that once the restart of the down node is acked, we immediately no longer see any of these connection errors on the other nodes. (Needs some tricky to make it work: probably best to run in local mode, and to touch a magic file in the other nodes' store dirs right after the down node acked the restart, and check for that marker file when seeing the error on the other nodes).

Btw, I think this may have gotten more frequent because now we gracefully drain nodes in acceptance/version-upgrade. This means that there is no temporary unavailability in the cluster which could've masked these kinds of failures. I think we can paper over this for now with a sleep; don't think this is a new regression.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/1af66356e77f516ede7cd56a3d7946873a590ef7/pkg/cli/start.go#L437

yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Sep 9, 2022
We have seen cases where a transient error could occur when a
newly-upgraded node serves as a gateway for a distributed query due
to remote nodes not being able to dial back to the gateway for some
reason (investigation of it is tracked in cockroachdb#87634). For now, we're
papering over these flakes by 4 second sleep.

Release note: None
craig bot pushed a commit that referenced this issue Sep 12, 2022
87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling

This commit fixes a small bug on the transaction insight details page
that was incorrectly mapping the waiting transaction statement
fingerprints to the blocking transaction statements. The commit also
aligns the summary cards in the details page. The commit also removes
the contended key from the details page while we look for a more user-
friendly format to display row contention.

Before:

![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png)


After:

![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png)

Fixes #87838.

Release note: None
Release justification: bug fix

87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich

We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep.

Addresses: #87104.

Release note: None

87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner

The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes.

Informs #86790

Release note: None

87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg

Front-ports parts of #87253.

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event.

We now handle this the same as the other case (where the stream client
errors out due to the cancellation).

This also checks in the test from
#87253 (which is on
release-22.1).

Fixes #87370.

No release note since this will be backported to release-22.2
Release note: None


Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
blathers-crl bot pushed a commit that referenced this issue Sep 12, 2022
We have seen cases where a transient error could occur when a
newly-upgraded node serves as a gateway for a distributed query due
to remote nodes not being able to dial back to the gateway for some
reason (investigation of it is tracked in #87634). For now, we're
papering over these flakes by 4 second sleep.

Release note: None
yuzefovich added a commit that referenced this issue Sep 15, 2022
We have seen cases where a transient error could occur when a
newly-upgraded node serves as a gateway for a distributed query due
to remote nodes not being able to dial back to the gateway for some
reason (investigation of it is tracked in #87634). For now, we're
papering over these flakes by 4 second sleep.

Release note: None
@erikgrinaker erikgrinaker added T-kv KV Team and removed T-kv-replication labels Sep 28, 2022
@andreimatei
Copy link
Contributor

Is this the same as #44101 ?

@tbg
Copy link
Member Author

tbg commented Dec 20, 2022

Looks like it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants