-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: FlowStream error to recently restarted node #87634
Comments
cc @cockroachdb/replication |
I think it would be helpful to reproduce this with the env var But I think this has to do with how the connection pool works. Not that it should work that way, but I think somehow what happens is that we have a connection in the pool (on n1, the not-restarted-node in the artifacts) that still looks "healthy" even though grpc's first dial has already hit an error (guessing: connection refused). The conn is then handed out here and survives the error check: cockroach/pkg/sql/colflow/colrpc/outbox.go Lines 176 to 184 in 2675c7c
but when we actually try to make a FlowStream client and connect, we prompt a redial (or something like that) and run into the cockroach/pkg/sql/colflow/colrpc/outbox.go Lines 186 to 200 in 2675c7c
This doesn't all obviously check out, since there is a health check when pulling the connection from the pool, cockroach/pkg/rpc/nodedialer/nodedialer.go Lines 207 to 216 in 50c1196
and I don't know why we're getting past it. It might be helpful to not try to reproduce this the full version-upgrade test, but simply a regular test where we repeatedly drain and instantly restart a node, then run a distributed query on it. Or maybe even better, add some ad-hoc code to a build that verifies an invariant that we want: that once the restart of the down node is acked, we immediately no longer see any of these connection errors on the other nodes. (Needs some tricky to make it work: probably best to run in local mode, and to touch a magic file in the other nodes' store dirs right after the down node acked the restart, and check for that marker file when seeing the error on the other nodes). Btw, I think this may have gotten more frequent because now we gracefully drain nodes in acceptance/version-upgrade. This means that there is no temporary unavailability in the cluster which could've masked these kinds of failures. I think we can paper over this for now with a sleep; don't think this is a new regression. Footnotes |
We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in cockroachdb#87634). For now, we're papering over these flakes by 4 second sleep. Release note: None
87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling This commit fixes a small bug on the transaction insight details page that was incorrectly mapping the waiting transaction statement fingerprints to the blocking transaction statements. The commit also aligns the summary cards in the details page. The commit also removes the contended key from the details page while we look for a more user- friendly format to display row contention. Before: ![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png) After: ![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png) Fixes #87838. Release note: None Release justification: bug fix 87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Addresses: #87104. Release note: None 87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes. Informs #86790 Release note: None 87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Release note: None
We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Release note: None
Is this the same as #44101 ? |
Looks like it. |
Describe the problem
Extracted from an acceptance/version-upgrade failure1.
We're likely deflaking this test by inserting a sleep but should investigate at some point.
To Reproduce
Potentially by stressing version-upgrade at the SHA of 1
Expected behavior
When a node has signaled readiness, it ought to be able to serve the FlowStream RPC. It's unclear if the problem is in the "other node's" RPC layer (is it holding on and using a connection to the peer from before the restart, that hasn't yet been torn down?) or whether the node is somehow serving requests before it is truly ready to do so, resulting in transient errors.
Artifacts attached.
debug.zip
artifacts.zip
Note: putting this into Replication team because the KV area nominally owns RPC and I started looking into this.
Jira issue: CRDB-19450
Epic CRDB-39899
Footnotes
https://github.com/cockroachdb/cockroach/issues/87104#issuecomment-1240906865 ↩ ↩2
The text was updated successfully, but these errors were encountered: