storage: more proactively replicaGC replicas with stuck commands #26952
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
S-2-temp-unavailability
Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone
A user reported privately that a node they restarted spend several minutes with heartbeats timing out due to a stale local replica of the liveness range.
There are two ways to think about this problem:
I think something we ought to explore is timing out commands at the proposal level. If a proposal spends more than (say) 10s waiting for a response from Raft, it checks the Raft group status. If that is neither follower nor leader, the proposal receives an ambiguous result and the replica is suggested for a high-priority ReplicaGC check (overriding a previous check).
The text was updated successfully, but these errors were encountered: