rpc: FlowStream error to recently restarted node #87634

tbg · 2022-09-08T18:23:43Z

Describe the problem

Extracted from an acceptance/version-upgrade failure¹.
We're likely deflaking this test by inserting a sleep but should investigate at some point.

To Reproduce

Potentially by stressing version-upgrade at the SHA of ¹

Expected behavior

When a node has signaled readiness, it ought to be able to serve the FlowStream RPC. It's unclear if the problem is in the "other node's" RPC layer (is it holding on and using a connection to the peer from before the restart, that hasn't yet been torn down?) or whether the node is somehow serving requests before it is truly ready to do so, resulting in transient errors.

Artifacts attached.

debug.zip
artifacts.zip

Note: putting this into Replication team because the KV area nominally owns RPC and I started looking into this.

Jira issue: CRDB-19450

Epic CRDB-39899

https://github.com/cockroachdb/cockroach/issues/87104#issuecomment-1240906865 ↩ ↩²

blathers-crl · 2022-09-08T18:23:45Z

cc @cockroachdb/replication

tbg · 2022-09-08T18:48:46Z

I think it would be helpful to reproduce this with the env var GRPC_GO_LOG_SEVERITY_LEVEL=info (or changing this¹ to info).

But I think this has to do with how the connection pool works. Not that it should work that way, but I think somehow what happens is that we have a connection in the pool (on n1, the not-restarted-node in the artifacts) that still looks "healthy" even though grpc's first dial has already hit an error (guessing: connection refused). The conn is then handed out here and survives the error check:

cockroach/pkg/sql/colflow/colrpc/outbox.go

Lines 176 to 184 in 2675c7c

    
           conn, err := execinfra.GetConnForOutbox(ctx, dialer, sqlInstanceID, connectionTimeout) 
        
           if err != nil { 
        
           	log.Warningf( 
        
           		ctx, 
        
           		"Outbox Dial connection error, distributed query will fail: %+v", 
        
           		err, 
        
           	) 
        
           	return err 
        
           }

but when we actually try to make a FlowStream client and connect, we prompt a redial (or something like that) and run into the onlyOnceDialer (which is something that we use internally to make sure that gRPC always lets our rpc wrappers do a new connection; this has to do with version checks etc) here:

cockroach/pkg/sql/colflow/colrpc/outbox.go

Lines 186 to 200 in 2675c7c

    
           client := execinfrapb.NewDistSQLClient(conn) 
        
           // We use the flow context for the RPC so that when outbox context is 
        
           // canceled in case of a graceful shutdown, the gRPC stream keeps on 
        
           // running. If, however, the flow context is canceled, then the 
        
           // termination of the whole query is ungraceful, so we're ok with the 
        
           // gRPC stream being ungracefully shutdown too. 
        
           stream, err = client.FlowStream(flowCtx) 
        
           if err != nil { 
        
           	log.Warningf( 
        
           		ctx, 
        
           		"Outbox FlowStream connection error, distributed query will fail: %+v", 
        
           		err, 
        
           	) 
        
           	return err 
        
           }

This doesn't all obviously check out, since there is a health check when pulling the connection from the pool,

cockroach/pkg/rpc/nodedialer/nodedialer.go

Lines 207 to 216 in 50c1196

    
           // Check to see if the connection is in the transient failure state. This can 
        
           // happen if the connection already existed, but a recent heartbeat has 
        
           // failed and we haven't yet torn down the connection. 
        
           if err := grpcutil.ConnectionReady(conn); err != nil { 
        
           	err = errors.Wrapf(err, "failed to check for ready connection to n%d at %v", nodeID, addr) 
        
           	if breaker != nil { 
        
           		breaker.Fail(err) 
        
           	} 
        
           	return nil, err 
        
           }

and I don't know why we're getting past it.

It might be helpful to not try to reproduce this the full version-upgrade test, but simply a regular test where we repeatedly drain and instantly restart a node, then run a distributed query on it.

Or maybe even better, add some ad-hoc code to a build that verifies an invariant that we want: that once the restart of the down node is acked, we immediately no longer see any of these connection errors on the other nodes. (Needs some tricky to make it work: probably best to run in local mode, and to touch a magic file in the other nodes' store dirs right after the down node acked the restart, and check for that marker file when seeing the error on the other nodes).

Btw, I think this may have gotten more frequent because now we gracefully drain nodes in acceptance/version-upgrade. This means that there is no temporary unavailability in the cluster which could've masked these kinds of failures. I think we can paper over this for now with a sleep; don't think this is a new regression.

https://github.com/cockroachdb/cockroach/blob/1af66356e77f516ede7cd56a3d7946873a590ef7/pkg/cli/start.go#L437 ↩

We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in cockroachdb#87634). For now, we're papering over these flakes by 4 second sleep. Release note: None

87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling This commit fixes a small bug on the transaction insight details page that was incorrectly mapping the waiting transaction statement fingerprints to the blocking transaction statements. The commit also aligns the summary cards in the details page. The commit also removes the contended key from the details page while we look for a more user- friendly format to display row contention. Before: ![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png) After: ![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png) Fixes #87838. Release note: None Release justification: bug fix 87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Addresses: #87104. Release note: None 87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes. Informs #86790 Release note: None 87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Release note: None

andreimatei · 2022-12-08T17:10:52Z

Is this the same as #44101 ?

tbg · 2022-12-20T10:06:20Z

Looks like it.

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Sep 8, 2022

tbg mentioned this issue Sep 8, 2022

roachtest: acceptance/version-upgrade is flaky #87104

Closed

yuzefovich mentioned this issue Sep 9, 2022

roachtest: add 4s of sleep after restart when upgrading nodes #87715

Merged

blathers-crl bot mentioned this issue Sep 12, 2022

release-22.2: roachtest: add 4s of sleep after restart when upgrading nodes #87882

Merged

erikgrinaker added T-kv KV Team and removed T-kv-replication labels Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: FlowStream error to recently restarted node #87634

rpc: FlowStream error to recently restarted node #87634

tbg commented Sep 8, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Sep 8, 2022

tbg commented Sep 8, 2022

andreimatei commented Dec 8, 2022

tbg commented Dec 20, 2022

rpc: FlowStream error to recently restarted node #87634

rpc: FlowStream error to recently restarted node #87634

Comments

tbg commented Sep 8, 2022 • edited by exalate-issue-sync bot Loading

Footnotes

blathers-crl bot commented Sep 8, 2022

tbg commented Sep 8, 2022

Footnotes

andreimatei commented Dec 8, 2022

tbg commented Dec 20, 2022

tbg commented Sep 8, 2022 •

edited by exalate-issue-sync bot

Loading