Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: more proactively replicaGC replicas with stuck commands #26952

Closed
tbg opened this issue Jun 25, 2018 · 2 comments
Closed

storage: more proactively replicaGC replicas with stuck commands #26952

tbg opened this issue Jun 25, 2018 · 2 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jun 25, 2018

A user reported privately that a node they restarted spend several minutes with heartbeats timing out due to a stale local replica of the liveness range.

There are two ways to think about this problem:

  1. it takes too long until the stale liveness range is replicaGC'ed.
  2. DistSender holds on to the stale cache entry for too long, likely because the error it gets from the RPC is a context cancellation error (which it makes sure not to eject from the cache for; context cancellation can bubble down from the SQL client connection).

I think something we ought to explore is timing out commands at the proposal level. If a proposal spends more than (say) 10s waiting for a response from Raft, it checks the Raft group status. If that is neither follower nor leader, the proposal receives an ambiguous result and the replica is suggested for a high-priority ReplicaGC check (overriding a previous check).

W180622 20:33:29.463217 197 storage/node_liveness.go:438  [n7,hb] failed node liveness heartbeat: context deadline exceeded
W180622 20:33:29.529018 198 sql/jobs/registry.go:300  canceling all jobs due to liveness failure
W180622 20:33:30.529649 198 sql/jobs/registry.go:300  canceling all jobs due to liveness failure
I180622 20:33:31.006571 1153 storage/store.go:3514  [n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] added to replica GC queue (peer suggestion)
I180622 20:33:31.006606 2016 storage/store.go:3514  [n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] added to replica GC queue (peer suggestion)
I180622 20:33:31.008290 31717 storage/store.go:2385  [replicaGC,n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] removing replica
I180622 20:33:31.010715 31717 storage/replica.go:836  [replicaGC,n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] removed 5655 (5644+11) keys in 2ms [clear=1ms commit=1ms]
W180622 20:33:31.529866 198 sql/jobs/registry.go:300  canceling all jobs due to liveness failure
I180622 20:33:32.006228 1153 storage/store.go:3514  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] added to replica GC queue (peer suggestion)
I180622 20:33:32.006268 2016 storage/store.go:3514  [n7,s7,r7379/4:/System/tsd/cr.node.timeserie…] added to replica GC queue (peer suggestion)
I180622 20:33:32.006300 2016 storage/store.go:3514  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] added to replica GC queue (peer suggestion)
I180622 20:33:32.006360 182 storage/store.go:3514  [n7,s7,r7379/4:/System/tsd/cr.node.timeserie…] added to replica GC queue (peer suggestion)
W180622 20:33:32.006719 341 storage/replica.go:5648  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not acquire lease for range gossip: r1051 was not found
W180622 20:33:32.006747 341 storage/store.go:1457  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not gossip node liveness: r1051 was not found
I180622 20:33:32.007643 31662 storage/store.go:2385  [replicaGC,n7,s7,r1051/24:/System/NodeLiveness{-Max}] removing replica
W180622 20:33:32.067389 197 storage/node_liveness.go:501  [n7,hb] slow heartbeat took 2.6s
I180622 20:33:32.122423 31662 storage/replica.go:836  [replicaGC,n7,s7,r1051/24:/System/NodeLiveness{-Max}] removed 17 (8+9) keys in 115ms [clear=82ms commit=33ms]
W180622 20:33:32.123176 341 storage/replica.go:5648  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not acquire lease for range gossip: r1051 was not found
W180622 20:33:32.123244 341 storage/store.go:1457  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not gossip node liveness: r1051 was not found
@tbg tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 25, 2018
@tbg tbg changed the title storage: minutes of failed heartbeats due to stale replica of liveness range storage: more proactively replicaGC replicas with stuck commands Jun 25, 2018
@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2018
@tbg tbg added this to the 2.1 milestone Jul 22, 2018
@tbg tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Aug 21, 2018
@tbg
Copy link
Member Author

tbg commented Sep 25, 2018

Possible connection to #24068.

@tbg
Copy link
Member Author

tbg commented Oct 11, 2018

I think I actually fix this in #31250.

@tbg tbg closed this as completed Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

1 participant