Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: node with stalled disk holds on to leases indefinitely #41691

Closed
ricardocrdb opened this issue Oct 17, 2019 · 12 comments
Closed

storage: node with stalled disk holds on to leases indefinitely #41691

ricardocrdb opened this issue Oct 17, 2019 · 12 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change.

Comments

@ricardocrdb
Copy link

v19.1.3 cluster has 9 nodes that had unavailable ranges for a duration for about 15 min before it became available by restarting cockroach process on nodes 2, 5, and 7.

image

This unavailability possibly correlated with a peak in the Raft logs falling behind, and all leaseholders disappearing:

image

image

Logging started to report liveness failures common on all 9 nodes, however, node1 reported the first chronological liveness failure:

W191015 14:50:35.060579 1231 storage/node_liveness.go:463 [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s

These messages appeared shortly after on all 8 other nodes.

Additionally, on node 2, system ranges were unavailable:

...
W191015 14:51:28.635437 925 storage/store_rebalancer.go:227  [n2,s2,store-rebalancer] StorePool missing descriptor for local store
W191015 14:51:30.578149 134175812 storage/replica_write.go:199  [n2,s2,r3178/3:/System/tsd/cr.node.liveness.…] have been waiting 60.00s for proposing command Merge [/System/tsd/cr.node.liveness.heartbeatfailures/4/10s/2019-10-15T14:00:00Z,/Min), Merge [/System/tsd/cr.node.liveness.epochincrements/4/10s/2019-10-15T14:00:00Z,/Min), Merge [/System/tsd/cr.node.liveness.heartbeatlatency-max/4/10s/2019-10-15T14:00:00Z,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

    https://yourhost:8080/#/reports/range/3178

and the following Raft status: {"id":"3","term":89,"vote":"3","commit":6367705,"lead":"3","raftState":"StateLeader","applied":6367705,"progress":{"1":{"match":6367705,"next":6367714,"state":"ProgressStateReplicate"},"2":{"match":6367705,"next":6367714,"state":"ProgressStateReplicate"},"3":{"match":6367713,"next":6367714,"state":"ProgressStateReplicate"}},"leadtransferee":"0"}

Restarting occurred on nodes 2, 5, and 7 before the ranges became available again. Need to determine the root cause for why this occurred.

@ricardocrdb ricardocrdb added the C-investigation Further steps needed to qualify. C-label will change. label Oct 17, 2019
@tim-o
Copy link
Contributor

tim-o commented Oct 17, 2019

@tbg @bdarnell is this another instance of #37906 or something new? The drop in ranges makes me think it's a different pattern but I'm not sure if that was also observed in #37906.

@irfansharif irfansharif self-assigned this Oct 17, 2019
@bdarnell
Copy link
Contributor

Doesn't look like #37906 to me. In that issue, the problems occur after a node restarts; they're not solved by doing more restarts.

It's clearly something related to the liveness range; we need to look closer at the nodes that held that range (probably 2, 5, and 7? How were those nodes selected to restart?)

@irfansharif
Copy link
Contributor

@ricardocrdb: Do you mind grabbing the debug.zip and posting it here?

@ricardocrdb
Copy link
Author

@irfansharif ha sorry I forgot to post that link. Debug zip is here.

@bdarnell I am not sure how or why those particular ranges were selected to be restarted. Taking a look, node 2 does contain ranges 1 and 3 as the leader and range 2 as a follower.

Range 1
{
  "span": {
    "start_key": "/Min",
    "end_key": "/System/NodeLiveness"
  },
  "raft_state": {
    "replica_id": 4,
    "hard_state": {
      "term": 143,
      "vote": 4,
      "commit": 1681176
    },
    "lead": 4,
    "state": "StateLeader",
...
Range 2
{
  "span": {
    "start_key": "/System/NodeLiveness",
    "end_key": "/System/NodeLivenessMax"
  },
  "raft_state": {
    "replica_id": 4,
    "hard_state": {
      "term": 372,
      "vote": 0,
      "commit": 19611870
    },
    "lead": 6,
    "state": "StateFollower",
...
Range 3
{
  "span": {
    "start_key": "/System/NodeLivenessMax",
    "end_key": "/System/tsd"
  },
  "raft_state": {
    "replica_id": 5,
    "hard_state": {
      "term": 148,
      "vote": 5,
      "commit": 10339455
    },
    "lead": 5,
    "state": "StateLeader",

@RoachietheSupportRoach
Copy link
Collaborator

Zendesk ticket #3900 has been linked to this issue.

@irfansharif
Copy link
Contributor

irfansharif commented Oct 17, 2019

To confirm (for myself) it's liveness related, the following appears on all nodes.

38222:W191015 14:47:38.089734 739 server/node.go:880  [n4,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:2 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
38958:W191015 14:50:38.092422 739 server/node.go:880  [n4,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}

~14:47:38 is also interesting, there's a sudden drop in replica leaseholders for 10.7.66.8 (n2) and corresponding spikes in "raft log behind", "under replicated ranges", and drop in "ranges".

@irfansharif
Copy link
Contributor

we need to look closer at the nodes that held that range (probably 2, 5, and 7? How were those nodes selected to restart?)

~ find debug | grep "/1.json"    
debug/nodes/7/ranges/1.json
debug/nodes/1/ranges/1.json
debug/nodes/3/ranges/1.json
debug/nodes/2/ranges/1.json
debug/nodes/5/ranges/1.json

~ find debug | grep "/2.json"
debug/nodes/7/ranges/2.json
debug/nodes/4/ranges/2.json
debug/nodes/3/ranges/2.json
debug/nodes/2/ranges/2.json
debug/nodes/5/ranges/2.json

~ find debug | grep "/3.json"
debug/nodes/1/ranges/3.json
debug/nodes/4/ranges/3.json
debug/nodes/3/ranges/3.json
debug/nodes/2/ranges/3.json
debug/nodes/5/ranges/3.json

@irfansharif
Copy link
Contributor

@ricardocrdb: Can we get the graphs for the following set of metrics? Around 14:30:00 - 15:30:00 should suffice.

liveness.heartbeatfailures
liveness.heartbeatsuccesses
liveness.epochincrements
liveness.livenodes
liveness.heartbeatlatency-p99.99

@irfansharif
Copy link
Contributor

On n2, right before things go belly up, we see this:

W191015 14:50:45.694152 134176681 server/server_engine_health.go:78  [n2] disk stall detected: unable to write to <no-attributes>=/cockroach/cockroach-data within 10s 

The symptoms here seem awfully similar to #32736. Looking at the following:

W191015 14:51:31.311093 134175533 storage/replica_write.go:199  [n2,s2,r2/4:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command RequestLease [/System/NodeLiveness,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

	https://yourhost:8080/#/reports/range/2

and the following Raft status: {"id":"4","term":335,"vote":"4","commit":19609002,"lead":"4","raftState":"StateLeader","applied":19609002,"progress":{"3":{"match":19609002,"next":19609013,"state":"ProgressStateReplicate"},"4":{"match":19609012,"next":19609013,"state":"ProgressStateReplicate"},"5":{"match":19609002,"next":19609013,"state":"ProgressStateReplicate"},"6":{"match":19609002,"next":19609013,"state":"ProgressStateReplicate"},"2":{"match":19609001,"next":19609013,"state":"ProgressStateReplicate"}},"leadtransferee":"0"}

We see n2 is the the raft leader for r2. Looking at debug/nodes/2/ranges/2.json, we see r2 is the node liveness range:

[...]
  "span": {
    "start_key": "/System/NodeLiveness",
    "end_key": "/System/NodeLivenessMax"
  },
[...]

n2 could maintains it liveness record as the other replicas are up and available, it would then hold all of its epoch-based leases without doing much else.

@irfansharif
Copy link
Contributor

+cc @bdarnell, I see you're looking at #41683.

@irfansharif
Copy link
Contributor

If you look at problemranges.json, every node save n2 lists out a whole set of ranges under no_lease_range_ids.

@ricardocrdb ricardocrdb changed the title Nodes became unavailable for a short period of time. Nodes became unavailable for a short period of time until cockroach process was restarted manually Oct 18, 2019
@irfansharif irfansharif changed the title Nodes became unavailable for a short period of time until cockroach process was restarted manually storage: node with stalled disk holds on to leases indefinitely Oct 21, 2019
@irfansharif
Copy link
Contributor

Closing in favor of #41683.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

5 participants