rpc: nodes fail to connect to peer even after the peer is up #44101

andreimatei · 2020-01-17T01:18:44Z

I believe I've run into the following race:

node 1 goes down
some node 2 code path tries to dial node 1. Let's call it dial attempt 1.
node 1 comes back, opens its ports
another code path on node 2 tries to dial. This is dial attempt 2. Dial attempt 1 is still in progress, so dial attempt two blocks on conn.initOnce
dial attempt one was in the process of failing, and eventually releases that lock
every subsequent attempt (i.e. dial attempt 2) that was convoyed behind attempt 1 fails with the same message

This is quite unfortunate, particularly in situations where dial attempt 2 is causally related to node 1 coming back. For example, by node 1 having sent a SetupFlow RPC, which asks node 2 to connect back to it. Node 2 failing to connect back is very unfortunate for the respective SQL query, which will wait in vain for a long time.

A suggested fix is by having node 2 consider network availability signals (incoming connections, or successful heartbeats) and making sure that no error from a dial attempt from before a signal is propagated to any dial attempt from after the signal.

cc @ajwerner @bdarnell

Epic: CRDB-8500

Jira issue: CRDB-5255

The text was updated successfully, but these errors were encountered:

This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101

This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None

44102: sql: don't distribute migration queries r=andreimatei a=andreimatei This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by #44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of #44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes #43957 Fixes #44005 Touches #44101 Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

ajwerner · 2022-03-23T19:12:18Z

This is a cool issue and would be good to fix at some point. I'm taking it out of my queue, as I haven't thought about it in years.

tbg · 2022-09-05T12:42:53Z

We could add code here

cockroach/pkg/server/server.go

Lines 251 to 261 in 2dc2da8

    
           OnIncomingPing: func(ctx context.Context, req *rpc.PingRequest) error { 
        
           	// Decommission state is only tracked for the system tenant. 
        
           	if tenantID, isTenant := roachpb.TenantFromContext(ctx); isTenant && 
        
           		!roachpb.IsSystemTenantID(tenantID.ToUint64()) { 
        
           		return nil 
        
           	} 
        
           	// Incoming ping will reject requests with codes.PermissionDenied to 
        
           	// signal remote node that it is not considered valid anymore and 
        
           	// operations should fail immediately. 
        
           	return checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied) 
        
           },

that resets the breaker for the originator of the PingRequest:

diff --git a/pkg/server/server.go b/pkg/server/server.go
index 3f24182238..3ed5616a67 100644
--- a/pkg/server/server.go
+++ b/pkg/server/server.go
@@ -233,6 +233,13 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
 
 	nodeTombStorage, checkPingFor := getPingCheckDecommissionFn(engines)
 
+	// This will be assigned once we've instantiated the nodeDialer.
+	//
+	// NB: not a fan of this pattern but alternative is a larger
+	// rearrangement. Would need to double-check that this won't
+	// be invoked before it's populated.
+	var resetNodeDialerBreakerTo func(nodeID roachpb.NodeID)
+
 	rpcCtxOpts := rpc.ContextOptions{
 		TenantID:         roachpb.SystemTenantID,
 		NodeID:           cfg.IDContainer,
@@ -257,7 +264,10 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
 			// Incoming ping will reject requests with codes.PermissionDenied to
 			// signal remote node that it is not considered valid anymore and
 			// operations should fail immediately.
-			return checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied)
+			if err := checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied); err != nil {
+				return err
+			}
+			resetNodeDialerBreakerTo(req.OriginNodeID)
 		},
 	}
 	if knobs := cfg.TestingKnobs.Server; knobs != nil {
@@ -322,6 +332,11 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
 
 	nodeDialer := nodedialer.NewWithOpt(rpcContext, gossip.AddressResolver(g),
 		nodedialer.DialerOpt{TestingKnobs: dialerKnobs})
+	resetNodeDialerBreakerTo = func(nodeID roachpb.NodeID) {
+		for class := rpc.ConnectionClass(0); int(class) < rpc.NumConnectionClasses; class++ {
+			nodeDialer.GetCircuitBreaker(nodeID, class).Reset()
+		}
+	}
 
 	runtimeSampler := status.NewRuntimeStatSampler(ctx, clock)
 	registry.AddMetricStruct(runtimeSampler)

This would deal with the "causally related" situations. For example, #87104 (comment) would now read:

n4 restarts
n1 has breaker to n4 tripped as a result
n4 issues SetupFlow to n1
- in particular, establishes healthy connection to n1
- which includes a heartbeat n4->n1
- which resets the breaker n1->n4
n1 checks breaker, which passes
n1 sets up the flow to n4

github-actions · 2024-02-28T11:04:19Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

andreimatei added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 17, 2020

andreimatei mentioned this issue Jan 17, 2020

sql: don't distribute migration queries #44102

Merged

andreimatei mentioned this issue Jan 17, 2020

roachtest: acceptance/version-upgrade failed #43957

Closed

ajwerner self-assigned this Jan 21, 2020

andreimatei mentioned this issue Apr 13, 2020

roachtest: acceptance/version-upgrade failed #47024

Closed

asubiotto mentioned this issue Jul 7, 2020

server: disable DistSQL in server's InternalExecutor until migrations complete #51053

Closed

lunevalex added the A-kv-server Relating to the KV-level RPC server label Jul 28, 2020

andreimatei mentioned this issue Aug 11, 2020

execinfra: add retries to outbox DialNoBreaker attempts #52624

Merged

aayushshah15 mentioned this issue Oct 8, 2020

sql: no inbound stream connection error on TPCC running on large cluster #55297

Closed

ajwerner mentioned this issue Apr 27, 2021

roachtest: schemachange/secondary-index-multi-version failed #63954

Closed

jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021

knz added the X-server-triaged-202105 label Jun 28, 2021

knz added A-server-networking Pertains to network addressing,routing,initialization A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021

ajwerner assigned ajwerner and unassigned ajwerner Mar 23, 2022

msbutler mentioned this issue Jul 11, 2022

roachtest: import/mixed-versions failed #84150

Closed

ajwerner removed their assignment Jul 11, 2022

yuzefovich mentioned this issue Sep 2, 2022

roachtest: acceptance/version-upgrade is flaky #87104

Closed

andreimatei mentioned this issue Dec 8, 2022

rpc: FlowStream error to recently restarted node #87634

Open

github-actions bot added the no-issue-activity label Feb 28, 2024

yuzefovich removed the no-issue-activity label Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: nodes fail to connect to peer even after the peer is up #44101

rpc: nodes fail to connect to peer even after the peer is up #44101

andreimatei commented Jan 17, 2020 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Mar 23, 2022

tbg commented Sep 5, 2022

github-actions bot commented Feb 28, 2024

rpc: nodes fail to connect to peer even after the peer is up #44101

rpc: nodes fail to connect to peer even after the peer is up #44101

Comments

andreimatei commented Jan 17, 2020 • edited by cockroach-jira-scripts Loading

ajwerner commented Mar 23, 2022

tbg commented Sep 5, 2022

github-actions bot commented Feb 28, 2024

andreimatei commented Jan 17, 2020 •

edited by cockroach-jira-scripts

Loading