storage: node with stalled disk holds on to leases indefinitely #41691

ricardocrdb · 2019-10-17T16:27:09Z

v19.1.3 cluster has 9 nodes that had unavailable ranges for a duration for about 15 min before it became available by restarting cockroach process on nodes 2, 5, and 7.

This unavailability possibly correlated with a peak in the Raft logs falling behind, and all leaseholders disappearing:

Logging started to report liveness failures common on all 9 nodes, however, node1 reported the first chronological liveness failure:

W191015 14:50:35.060579 1231 storage/node_liveness.go:463 [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s

These messages appeared shortly after on all 8 other nodes.

Additionally, on node 2, system ranges were unavailable:

...
W191015 14:51:28.635437 925 storage/store_rebalancer.go:227  [n2,s2,store-rebalancer] StorePool missing descriptor for local store
W191015 14:51:30.578149 134175812 storage/replica_write.go:199  [n2,s2,r3178/3:/System/tsd/cr.node.liveness.…] have been waiting 60.00s for proposing command Merge [/System/tsd/cr.node.liveness.heartbeatfailures/4/10s/2019-10-15T14:00:00Z,/Min), Merge [/System/tsd/cr.node.liveness.epochincrements/4/10s/2019-10-15T14:00:00Z,/Min), Merge [/System/tsd/cr.node.liveness.heartbeatlatency-max/4/10s/2019-10-15T14:00:00Z,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

    https://yourhost:8080/#/reports/range/3178

and the following Raft status: {"id":"3","term":89,"vote":"3","commit":6367705,"lead":"3","raftState":"StateLeader","applied":6367705,"progress":{"1":{"match":6367705,"next":6367714,"state":"ProgressStateReplicate"},"2":{"match":6367705,"next":6367714,"state":"ProgressStateReplicate"},"3":{"match":6367713,"next":6367714,"state":"ProgressStateReplicate"}},"leadtransferee":"0"}

Restarting occurred on nodes 2, 5, and 7 before the ranges became available again. Need to determine the root cause for why this occurred.

The text was updated successfully, but these errors were encountered:

tim-o · 2019-10-17T16:43:21Z

@tbg @bdarnell is this another instance of #37906 or something new? The drop in ranges makes me think it's a different pattern but I'm not sure if that was also observed in #37906.

bdarnell · 2019-10-17T16:52:30Z

Doesn't look like #37906 to me. In that issue, the problems occur after a node restarts; they're not solved by doing more restarts.

It's clearly something related to the liveness range; we need to look closer at the nodes that held that range (probably 2, 5, and 7? How were those nodes selected to restart?)

irfansharif · 2019-10-17T16:53:27Z

@ricardocrdb: Do you mind grabbing the debug.zip and posting it here?

ricardocrdb · 2019-10-17T17:00:50Z

@irfansharif ha sorry I forgot to post that link. Debug zip is here.

@bdarnell I am not sure how or why those particular ranges were selected to be restarted. Taking a look, node 2 does contain ranges 1 and 3 as the leader and range 2 as a follower.

Range 1
{
  "span": {
    "start_key": "/Min",
    "end_key": "/System/NodeLiveness"
  },
  "raft_state": {
    "replica_id": 4,
    "hard_state": {
      "term": 143,
      "vote": 4,
      "commit": 1681176
    },
    "lead": 4,
    "state": "StateLeader",
...
Range 2
{
  "span": {
    "start_key": "/System/NodeLiveness",
    "end_key": "/System/NodeLivenessMax"
  },
  "raft_state": {
    "replica_id": 4,
    "hard_state": {
      "term": 372,
      "vote": 0,
      "commit": 19611870
    },
    "lead": 6,
    "state": "StateFollower",
...
Range 3
{
  "span": {
    "start_key": "/System/NodeLivenessMax",
    "end_key": "/System/tsd"
  },
  "raft_state": {
    "replica_id": 5,
    "hard_state": {
      "term": 148,
      "vote": 5,
      "commit": 10339455
    },
    "lead": 5,
    "state": "StateLeader",

RoachietheSupportRoach · 2019-10-17T17:31:30Z

Zendesk ticket #3900 has been linked to this issue.

irfansharif · 2019-10-17T18:06:18Z

To confirm (for myself) it's liveness related, the following appears on all nodes.

38222:W191015 14:47:38.089734 739 server/node.go:880  [n4,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:2 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
38958:W191015 14:50:38.092422 739 server/node.go:880  [n4,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}

~14:47:38 is also interesting, there's a sudden drop in replica leaseholders for 10.7.66.8 (n2) and corresponding spikes in "raft log behind", "under replicated ranges", and drop in "ranges".

irfansharif · 2019-10-17T18:17:26Z

we need to look closer at the nodes that held that range (probably 2, 5, and 7? How were those nodes selected to restart?)

~ find debug | grep "/1.json"    
debug/nodes/7/ranges/1.json
debug/nodes/1/ranges/1.json
debug/nodes/3/ranges/1.json
debug/nodes/2/ranges/1.json
debug/nodes/5/ranges/1.json

~ find debug | grep "/2.json"
debug/nodes/7/ranges/2.json
debug/nodes/4/ranges/2.json
debug/nodes/3/ranges/2.json
debug/nodes/2/ranges/2.json
debug/nodes/5/ranges/2.json

~ find debug | grep "/3.json"
debug/nodes/1/ranges/3.json
debug/nodes/4/ranges/3.json
debug/nodes/3/ranges/3.json
debug/nodes/2/ranges/3.json
debug/nodes/5/ranges/3.json

irfansharif · 2019-10-17T20:39:56Z

@ricardocrdb: Can we get the graphs for the following set of metrics? Around 14:30:00 - 15:30:00 should suffice.

liveness.heartbeatfailures
liveness.heartbeatsuccesses
liveness.epochincrements
liveness.livenodes
liveness.heartbeatlatency-p99.99

irfansharif · 2019-10-18T02:24:40Z

On n2, right before things go belly up, we see this:

W191015 14:50:45.694152 134176681 server/server_engine_health.go:78  [n2] disk stall detected: unable to write to <no-attributes>=/cockroach/cockroach-data within 10s

The symptoms here seem awfully similar to #32736. Looking at the following:

W191015 14:51:31.311093 134175533 storage/replica_write.go:199  [n2,s2,r2/4:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command RequestLease [/System/NodeLiveness,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

	https://yourhost:8080/#/reports/range/2

and the following Raft status: {"id":"4","term":335,"vote":"4","commit":19609002,"lead":"4","raftState":"StateLeader","applied":19609002,"progress":{"3":{"match":19609002,"next":19609013,"state":"ProgressStateReplicate"},"4":{"match":19609012,"next":19609013,"state":"ProgressStateReplicate"},"5":{"match":19609002,"next":19609013,"state":"ProgressStateReplicate"},"6":{"match":19609002,"next":19609013,"state":"ProgressStateReplicate"},"2":{"match":19609001,"next":19609013,"state":"ProgressStateReplicate"}},"leadtransferee":"0"}

We see n2 is the the raft leader for r2. Looking at debug/nodes/2/ranges/2.json, we see r2 is the node liveness range:

[...]
  "span": {
    "start_key": "/System/NodeLiveness",
    "end_key": "/System/NodeLivenessMax"
  },
[...]

n2 could maintains it liveness record as the other replicas are up and available, it would then hold all of its epoch-based leases without doing much else.

irfansharif · 2019-10-18T02:26:21Z

+cc @bdarnell, I see you're looking at #41683.

irfansharif · 2019-10-18T02:43:10Z

If you look at problemranges.json, every node save n2 lists out a whole set of ranges under no_lease_range_ids.

irfansharif · 2019-10-21T16:19:07Z

Closing in favor of #41683.

ricardocrdb added the C-investigation Further steps needed to qualify. C-label will change. label Oct 17, 2019

irfansharif self-assigned this Oct 17, 2019

irfansharif mentioned this issue Oct 17, 2019

schemachange: attempting to update succeeded job over and over #38088

Closed

ricardocrdb changed the title ~~Nodes became unavailable for a short period of time.~~ Nodes became unavailable for a short period of time until cockroach process was restarted manually Oct 18, 2019

irfansharif changed the title ~~Nodes became unavailable for a short period of time until cockroach process was restarted manually~~ storage: node with stalled disk holds on to leases indefinitely Oct 21, 2019

irfansharif closed this as completed Oct 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: node with stalled disk holds on to leases indefinitely #41691

storage: node with stalled disk holds on to leases indefinitely #41691

ricardocrdb commented Oct 17, 2019

tim-o commented Oct 17, 2019

bdarnell commented Oct 17, 2019

irfansharif commented Oct 17, 2019

ricardocrdb commented Oct 17, 2019

RoachietheSupportRoach commented Oct 17, 2019

irfansharif commented Oct 17, 2019 •

edited

Loading

irfansharif commented Oct 17, 2019

irfansharif commented Oct 17, 2019

irfansharif commented Oct 18, 2019

irfansharif commented Oct 18, 2019

irfansharif commented Oct 18, 2019

irfansharif commented Oct 21, 2019

storage: node with stalled disk holds on to leases indefinitely #41691

storage: node with stalled disk holds on to leases indefinitely #41691

Comments

ricardocrdb commented Oct 17, 2019

tim-o commented Oct 17, 2019

bdarnell commented Oct 17, 2019

irfansharif commented Oct 17, 2019

ricardocrdb commented Oct 17, 2019

RoachietheSupportRoach commented Oct 17, 2019

irfansharif commented Oct 17, 2019 • edited Loading

irfansharif commented Oct 17, 2019

irfansharif commented Oct 17, 2019

irfansharif commented Oct 18, 2019

irfansharif commented Oct 18, 2019

irfansharif commented Oct 18, 2019

irfansharif commented Oct 21, 2019

irfansharif commented Oct 17, 2019 •

edited

Loading