-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: nodes fail to connect to peer even after the peer is up #44101
Comments
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None
44102: sql: don't distribute migration queries r=andreimatei a=andreimatei This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by #44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of #44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes #43957 Fixes #44005 Touches #44101 Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
This is a cool issue and would be good to fix at some point. I'm taking it out of my queue, as I haven't thought about it in years. |
We could add code here cockroach/pkg/server/server.go Lines 251 to 261 in 2dc2da8
that resets the breaker for the originator of the PingRequest: diff --git a/pkg/server/server.go b/pkg/server/server.go
index 3f24182238..3ed5616a67 100644
--- a/pkg/server/server.go
+++ b/pkg/server/server.go
@@ -233,6 +233,13 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
nodeTombStorage, checkPingFor := getPingCheckDecommissionFn(engines)
+ // This will be assigned once we've instantiated the nodeDialer.
+ //
+ // NB: not a fan of this pattern but alternative is a larger
+ // rearrangement. Would need to double-check that this won't
+ // be invoked before it's populated.
+ var resetNodeDialerBreakerTo func(nodeID roachpb.NodeID)
+
rpcCtxOpts := rpc.ContextOptions{
TenantID: roachpb.SystemTenantID,
NodeID: cfg.IDContainer,
@@ -257,7 +264,10 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
// Incoming ping will reject requests with codes.PermissionDenied to
// signal remote node that it is not considered valid anymore and
// operations should fail immediately.
- return checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied)
+ if err := checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied); err != nil {
+ return err
+ }
+ resetNodeDialerBreakerTo(req.OriginNodeID)
},
}
if knobs := cfg.TestingKnobs.Server; knobs != nil {
@@ -322,6 +332,11 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
nodeDialer := nodedialer.NewWithOpt(rpcContext, gossip.AddressResolver(g),
nodedialer.DialerOpt{TestingKnobs: dialerKnobs})
+ resetNodeDialerBreakerTo = func(nodeID roachpb.NodeID) {
+ for class := rpc.ConnectionClass(0); int(class) < rpc.NumConnectionClasses; class++ {
+ nodeDialer.GetCircuitBreaker(nodeID, class).Reset()
+ }
+ }
runtimeSampler := status.NewRuntimeStatSampler(ctx, clock)
registry.AddMetricStruct(runtimeSampler)
This would deal with the "causally related" situations. For example, #87104 (comment) would now read:
|
We have marked this issue as stale because it has been inactive for |
I believe I've run into the following race:
conn.initOnce
This is quite unfortunate, particularly in situations where dial attempt 2 is causally related to node 1 coming back. For example, by node 1 having sent a
SetupFlow
RPC, which asks node 2 to connect back to it. Node 2 failing to connect back is very unfortunate for the respective SQL query, which will wait in vain for a long time.A suggested fix is by having node 2 consider network availability signals (incoming connections, or successful heartbeats) and making sure that no error from a dial attempt from before a signal is propagated to any dial attempt from after the signal.
cc @ajwerner @bdarnell
Epic: CRDB-8500
Jira issue: CRDB-5255
The text was updated successfully, but these errors were encountered: