-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: node with stalled disk holds on to leases indefinitely #41691
Comments
Doesn't look like #37906 to me. In that issue, the problems occur after a node restarts; they're not solved by doing more restarts. It's clearly something related to the liveness range; we need to look closer at the nodes that held that range (probably 2, 5, and 7? How were those nodes selected to restart?) |
@ricardocrdb: Do you mind grabbing the debug.zip and posting it here? |
@irfansharif ha sorry I forgot to post that link. Debug zip is here. @bdarnell I am not sure how or why those particular ranges were selected to be restarted. Taking a look, node 2 does contain ranges 1 and 3 as the leader and range 2 as a follower.
|
Zendesk ticket #3900 has been linked to this issue. |
To confirm (for myself) it's liveness related, the following appears on all nodes.
~14:47:38 is also interesting, there's a sudden drop in replica leaseholders for 10.7.66.8 (n2) and corresponding spikes in "raft log behind", "under replicated ranges", and drop in "ranges". |
~ find debug | grep "/1.json"
debug/nodes/7/ranges/1.json
debug/nodes/1/ranges/1.json
debug/nodes/3/ranges/1.json
debug/nodes/2/ranges/1.json
debug/nodes/5/ranges/1.json
~ find debug | grep "/2.json"
debug/nodes/7/ranges/2.json
debug/nodes/4/ranges/2.json
debug/nodes/3/ranges/2.json
debug/nodes/2/ranges/2.json
debug/nodes/5/ranges/2.json
~ find debug | grep "/3.json"
debug/nodes/1/ranges/3.json
debug/nodes/4/ranges/3.json
debug/nodes/3/ranges/3.json
debug/nodes/2/ranges/3.json
debug/nodes/5/ranges/3.json |
@ricardocrdb: Can we get the graphs for the following set of metrics? Around 14:30:00 - 15:30:00 should suffice.
|
On
The symptoms here seem awfully similar to #32736. Looking at the following:
We see
|
If you look at |
Closing in favor of #41683. |
v19.1.3 cluster has 9 nodes that had unavailable ranges for a duration for about 15 min before it became available by restarting cockroach process on nodes 2, 5, and 7.
This unavailability possibly correlated with a peak in the Raft logs falling behind, and all leaseholders disappearing:
Logging started to report liveness failures common on all 9 nodes, however, node1 reported the first chronological liveness failure:
W191015 14:50:35.060579 1231 storage/node_liveness.go:463 [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
These messages appeared shortly after on all 8 other nodes.
Additionally, on node 2, system ranges were unavailable:
Restarting occurred on nodes 2, 5, and 7 before the ranges became available again. Need to determine the root cause for why this occurred.
The text was updated successfully, but these errors were encountered: