storage: more proactively replicaGC replicas with stuck commands #26952

tbg · 2018-06-25T18:14:26Z

A user reported privately that a node they restarted spend several minutes with heartbeats timing out due to a stale local replica of the liveness range.

There are two ways to think about this problem:

it takes too long until the stale liveness range is replicaGC'ed.
DistSender holds on to the stale cache entry for too long, likely because the error it gets from the RPC is a context cancellation error (which it makes sure not to eject from the cache for; context cancellation can bubble down from the SQL client connection).

I think something we ought to explore is timing out commands at the proposal level. If a proposal spends more than (say) 10s waiting for a response from Raft, it checks the Raft group status. If that is neither follower nor leader, the proposal receives an ambiguous result and the replica is suggested for a high-priority ReplicaGC check (overriding a previous check).

W180622 20:33:29.463217 197 storage/node_liveness.go:438  [n7,hb] failed node liveness heartbeat: context deadline exceeded
W180622 20:33:29.529018 198 sql/jobs/registry.go:300  canceling all jobs due to liveness failure
W180622 20:33:30.529649 198 sql/jobs/registry.go:300  canceling all jobs due to liveness failure
I180622 20:33:31.006571 1153 storage/store.go:3514  [n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] added to replica GC queue (peer suggestion)
I180622 20:33:31.006606 2016 storage/store.go:3514  [n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] added to replica GC queue (peer suggestion)
I180622 20:33:31.008290 31717 storage/store.go:2385  [replicaGC,n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] removing replica
I180622 20:33:31.010715 31717 storage/replica.go:836  [replicaGC,n7,s7,r6792/3:/System/tsd/cr.node.sql.mem.a…] removed 5655 (5644+11) keys in 2ms [clear=1ms commit=1ms]
W180622 20:33:31.529866 198 sql/jobs/registry.go:300  canceling all jobs due to liveness failure
I180622 20:33:32.006228 1153 storage/store.go:3514  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] added to replica GC queue (peer suggestion)
I180622 20:33:32.006268 2016 storage/store.go:3514  [n7,s7,r7379/4:/System/tsd/cr.node.timeserie…] added to replica GC queue (peer suggestion)
I180622 20:33:32.006300 2016 storage/store.go:3514  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] added to replica GC queue (peer suggestion)
I180622 20:33:32.006360 182 storage/store.go:3514  [n7,s7,r7379/4:/System/tsd/cr.node.timeserie…] added to replica GC queue (peer suggestion)
W180622 20:33:32.006719 341 storage/replica.go:5648  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not acquire lease for range gossip: r1051 was not found
W180622 20:33:32.006747 341 storage/store.go:1457  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not gossip node liveness: r1051 was not found
I180622 20:33:32.007643 31662 storage/store.go:2385  [replicaGC,n7,s7,r1051/24:/System/NodeLiveness{-Max}] removing replica
W180622 20:33:32.067389 197 storage/node_liveness.go:501  [n7,hb] slow heartbeat took 2.6s
I180622 20:33:32.122423 31662 storage/replica.go:836  [replicaGC,n7,s7,r1051/24:/System/NodeLiveness{-Max}] removed 17 (8+9) keys in 115ms [clear=82ms commit=33ms]
W180622 20:33:32.123176 341 storage/replica.go:5648  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not acquire lease for range gossip: r1051 was not found
W180622 20:33:32.123244 341 storage/store.go:1457  [n7,s7,r1051/24:/System/NodeLiveness{-Max}] could not gossip node liveness: r1051 was not found

The text was updated successfully, but these errors were encountered:

tbg · 2018-09-25T14:12:11Z

Possible connection to #24068.

tbg · 2018-10-11T11:57:39Z

I think I actually fix this in #31250.

tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 25, 2018

tbg changed the title ~~storage: minutes of failed heartbeats due to stale replica of liveness range~~ storage: more proactively replicaGC replicas with stuck commands Jun 25, 2018

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2018

tbg added this to the 2.1 milestone Jul 22, 2018

tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Aug 21, 2018

tbg modified the milestones: 2.1, 2.2 Oct 9, 2018

tbg mentioned this issue Oct 11, 2018

storage: non-member replicas can be left around after a previously dead node restarts #24067

Closed

tbg closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: more proactively replicaGC replicas with stuck commands #26952

storage: more proactively replicaGC replicas with stuck commands #26952

tbg commented Jun 25, 2018

tbg commented Sep 25, 2018

tbg commented Oct 11, 2018

storage: more proactively replicaGC replicas with stuck commands #26952

storage: more proactively replicaGC replicas with stuck commands #26952

Comments

tbg commented Jun 25, 2018

tbg commented Sep 25, 2018

tbg commented Oct 11, 2018