kvserver: count draining nodes as live when computing quorum #68652

aayushshah15 · 2021-08-10T15:12:05Z

Similar to #67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission any other nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
subset of "stuck decommissioning" issues we've seen in the wild.

Follows from cockroachlabs/support#1105

Release justification: bug fix

Release note(bug fix): Previously, draining a quorum of nodes (i.e. >=2 if
replication factor is 3, >=3 if replication factor is 5, etc) would block the
subsequent decommissioning of any other nodes in the cluster. This patch fixes
this bug. Now, if the lowest replication factor of some zone in the cluster is
RF, operators should be able to safely drain up to RF-1 nodes simultaneously.

cockroach-teamcity · 2021-08-10T15:14:34Z

This change is

erikgrinaker

modulo test failures

Reviewed 5 of 5 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

nvanbenschoten

This mostly , though I'm a little unclear on why we want draining nodes to be considered live for the purposes of computing quorum. Is my understanding correct that once a node has finished draining, it will terminate and become non-live? If so, then why are we finding ourselves in states where we have 3 nodes that are draining and also have a node performing a decommission and we need these draining nodes to contribute to quorum? If the nodes completed draining and terminated then they certainly wouldn't contribute to quorum, so aren't the expectations here racy? Put differently, if a draining node can at any moment finish draining and terminate, then isn't a draining node as good as dead?

Reviewed 5 of 5 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/store_pool.go, line 706 at r1 (raw file):

		// No-op.
		case storeStatusSuspect, storeStatusDraining:
			if includeSuspectAndDrainingStores {

Did any unit tests pick up this change? If not, are we missing some test coverage?

pkg/kv/kvserver/store_rebalancer.go, line 429 at r2 (raw file):

		preferred := sr.rq.allocator.preferredLeaseholders(zone, candidates)

		// Filter both the list of preferred stores as well as the list of all

Do we have a way to unit test this change in behavior? Ideally in something targetted like TestChooseLeaseToTransfer.

pkg/kv/kvserver/store_rebalancer.go, line 434 at r2 (raw file):

		const includeSuspectAndDrainingStores = false
		preferred, _ = sr.rq.allocator.storePool.liveAndDeadReplicas(preferred, includeSuspectAndDrainingStores)
		candidates, _ = sr.rq.allocator.storePool.liveAndDeadReplicas(candidates, includeSuspectAndDrainingStores)

Were we previously able to transfer the lease to dead nodes as well? I don't see what was preventing that.

aayushshah15 · 2021-08-13T18:51:21Z

Is my understanding correct that once a node has finished draining, it will terminate and become non-live?

The nodes wont terminate once the draining process completes, they stay in this state of not accepting new leases/replicas and new SQL connections until they're either restarted or terminated explicitly.

One of our customers had a workflow where if they had to remove, say 10 nodes from the cluster, they would first mark all these 10 nodes as draining and then gradually decommission them (this was to avoid moving replicas to node that are soon to be decommissioned anyway). One of the main reasons this decommissioning would get stuck is because the allocator would incorrectly think that a bunch of ranges were unavailable (and thus would not take any action on them).

You're right about the lacking unit test coverage, I'll update the PR with some.

nvanbenschoten · 2021-08-13T21:00:49Z

Got it, thanks for the explanation. In that case, I agree with the changes that this PR is making.

knz · 2021-08-16T11:32:05Z

I really like this PR a lot.
I also think it needs a release note, because it answers a question that @taroface had a few months ago: "what would be a good use case for a separate drain command?"

The explanation you provide above is such a good use case, and it deserves to be in the release note so the docs team can pick it up once this change is in.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

Otherwise, they are excluded from the returned slices.

Is it true that replicas can be excluded from both results? So, if this flag is not passed, the caller needs to consider three categories of replicas - live, dead, neither, with the latter not being explicitly represented? This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?

pkg/kv/kvserver/store_pool.go, line 239 at r4 (raw file):

	storeStatusAvailable
	// The store is decommissioning.
	storeStatusDecommissioning

maybe take the opportunity to explain the priority between this and dead/suspect/draining.

pkg/kv/kvserver/store_pool.go, line 240 at r4 (raw file):

	// The store is decommissioning.
	storeStatusDecommissioning
	// The store failed it's liveness heartbeat recently and is considered

maybe take the opportunity to explain that stores move from dead to suspicious.

pkg/kv/kvserver/store_pool.go, line 833 at r4 (raw file):

// the required attributes and their associated stats. The storeList is filtered
// according to the provided storeFilter. It also returns the total number of
// alive and throttled stores.

The comment on returning the number of throttled stores seems obsolete. Perhaps take the opportunity to explain the last retval; it seems like a weird return value because it doesn't include store ids.

aayushshah15

The explanation you provide above is such a good use case

Even with this patch, we cannot actually support draining >= RF nodes at the same time where RF is the lowest replication factor of some zone in the cluster. This is because decommissioning any of these draining nodes will require lease transfers away from the decommissioning node (since the leaseholders on the decommissioning node cannot remove themselves). These lease transfers need to succeed, so every range needs to have at least one replica on a non-draining node.

This patch does let the user simultaneously drain RF-1 nodes from their cluster though. I've amended the release note to reflect this.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?

The callers use that deadReplicas slice essentially as the list of replicas that need to be replaced -- which is not what we'd like to do for replicas on draining or suspect nodes. So in this case, it feels like we do need the disambiguation.

pkg/kv/kvserver/store_pool.go, line 706 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did any unit tests pick up this change? If not, are we missing some test coverage?

Added a short new test for it.

pkg/kv/kvserver/store_pool.go, line 239 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe take the opportunity to explain the priority between this and dead/suspect/draining.

Done.

pkg/kv/kvserver/store_pool.go, line 240 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe take the opportunity to explain that stores move from dead to suspicious.

Done.

pkg/kv/kvserver/store_pool.go, line 833 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The comment on returning the number of throttled stores seems obsolete. Perhaps take the opportunity to explain the last retval; it seems like a weird return value because it doesn't include store ids.

Fixed.

pkg/kv/kvserver/store_rebalancer.go, line 429 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we have a way to unit test this change in behavior? Ideally in something targetted like TestChooseLeaseToTransfer.

TestChooseLeaseToTransfer is getting rehauled in #65379. Would you mind if I handled it there?

pkg/kv/kvserver/store_rebalancer.go, line 434 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Were we previously able to transfer the lease to dead nodes as well? I don't see what was preventing that.

Yeah, nothing was preventing that.

andreimatei

One of our customers had a workflow where if they had to remove, say 10 nodes from the cluster, they would first mark all these 10 nodes as draining and then gradually decommission them (this was to avoid moving replicas to node that are soon to be decommissioned anyway). One of the main reasons this decommissioning would get stuck is because the allocator would incorrectly think that a bunch of ranges were unavailable (and thus would not take any action on them).

Please include this paragraph in the commit message somehow.

Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?

The callers use that deadReplicas slice essentially as the list of replicas that need to be replaced -- which is not what we'd like to do for replicas on draining or suspect nodes. So in this case, it feels like we do need the disambiguation.

Ack on the distinction between dead and suspect/draining.
I won't make a big deal out of it, but I'd still encourage figuring out a different format for the returned values. Perhaps get rid of includeSuspectStores and return a replicasWithStatus object with filtering methods that let the caller more explicitly choose what they're looking at.

pkg/kv/kvserver/store_pool.go, line 239 at r13 (raw file):

	storeStatusAvailable
	// The store is decommissioning. If draining or suspect stores are
	// decommissioned, this status takes precedence over `storeStatusDraining`

nit: s/decommissioned/decommissioning

andreimatei · 2021-08-24T20:24:27Z

how come the 2nd commit doesn't close cockroachlabs/support#1105 ?

aayushshah15

how come the 2nd commit doesn't close cockroachlabs/support#1105 ?

I think its possible we still have an undiagnosed issue with draining nodes that are failing liveness heartbeats. I think #67714 plus this patch may solve it but I'd like to test it out with that customer before we close that ticket.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Ack on the distinction between dead and suspect/draining.
I won't make a big deal out of it, but I'd still encourage figuring out a different format for the returned values. Perhaps get rid of includeSuspectStores and return a replicasWithStatus object with filtering methods that let the caller more explicitly choose what they're looking at.

👍 Agreed. Added a TODO for now since there's a ton of callers for this function and I'd like to keep the patch small.

pkg/kv/kvserver/store_pool.go, line 239 at r13 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: s/decommissioned/decommissioning

done.

Similar to cockroachdb#67714 Draining nodes were considered non-live by the allocator when it made the determination of whether a range could achieve quorum. This meant that, for instance, on a cluster with a replication factor of 5, if we had 3 or more nodes marked draining, we (with a high likelihood) wouldn't be able to decommission nodes from the cluster. Furthermore, due to the same reason as above the system also would incorrectly decide to not rebalance ranges that had more than a quorum of replicas on draining nodes. This patch fixes this problem by considering replicas on draining nodes as live for the purposes of determining whether a range has quorum. This likely fixes a considerable subset of "stuck decommissioning" issues we've seen in the wild. Follows from https://github.com/cockroachlabs/support/issues/1105 Release note: None

…toreRebalancer This commit prevents the StoreRebalancer from transferring leases to replicas on draining or suspect nodes. In some cases, we've seen this to cause new leases to be pushed to nodes that take too long to drain or that are stuck while draining due to other bugs. Informs https://github.com/cockroachlabs/support/issues/1105 Release note: None

This commit adds a roachtest that's meant to be a regression test against the hazard addressed by the first commit in this PR. This roachtest is meant to ensure that nodes that are marked "draining" are considered "live" by the allocator for when it makes the determination of whether a range can achieve quorum. Release note: None

aayushshah15 · 2021-09-02T17:56:14Z

CI was failing due to missing release justification.

TFTRs!

bors r+

craig · 2021-09-02T20:59:07Z

Build succeeded:

GitHub CI (Cockroach)

aayushshah15 force-pushed the 20210805_fixStuckDraining branch 6 times, most recently from dfa244d to 5665822 Compare August 13, 2021 00:27

aayushshah15 marked this pull request as ready for review August 13, 2021 00:27

aayushshah15 requested review from a team as code owners August 13, 2021 00:27

aayushshah15 requested review from erikgrinaker and nvanbenschoten and removed request for a team and erikgrinaker August 13, 2021 00:27

erikgrinaker approved these changes Aug 13, 2021

View reviewed changes

nvanbenschoten reviewed Aug 13, 2021

View reviewed changes

aayushshah15 force-pushed the 20210805_fixStuckDraining branch 3 times, most recently from 1de5f62 to 58494cf Compare August 13, 2021 23:53

aayushshah15 requested a review from andreimatei August 16, 2021 15:38

aayushshah15 mentioned this pull request Aug 16, 2021

kvserver: consider suspect stores "live" for computing quorum #67714

Merged

andreimatei reviewed Aug 16, 2021

View reviewed changes

taroface mentioned this pull request Aug 16, 2021

clarify node drain vs. quit cockroachdb/docs#9009

Closed

aayushshah15 mentioned this pull request Aug 17, 2021

kvserver: lease counts diverge when a new node is added to a cluster #67740

Closed

aayushshah15 force-pushed the 20210805_fixStuckDraining branch from 58494cf to e97935e Compare August 21, 2021 00:15

aayushshah15 requested a review from a team as a code owner August 21, 2021 00:15

aayushshah15 force-pushed the 20210805_fixStuckDraining branch from e97935e to c63875d Compare August 21, 2021 00:42

aayushshah15 commented Aug 21, 2021

View reviewed changes

aayushshah15 force-pushed the 20210805_fixStuckDraining branch from c63875d to a7e323f Compare August 22, 2021 19:59

andreimatei approved these changes Aug 24, 2021

View reviewed changes

aayushshah15 force-pushed the 20210805_fixStuckDraining branch from a7e323f to 812c74c Compare August 30, 2021 15:28

aayushshah15 commented Aug 30, 2021

View reviewed changes

aayushshah15 added 3 commits September 2, 2021 11:59

aayushshah15 force-pushed the 20210805_fixStuckDraining branch from 812c74c to ff2ae38 Compare September 2, 2021 16:11

craig bot merged commit 2a33514 into cockroachdb:master Sep 2, 2021

This was referenced Sep 9, 2021

release-21.1: kvserver: count draining nodes as live when computing quorum #69970

Closed

release-21.1: kvserver: count draining nodes as live when computing quorum #69971

Merged

irfansharif mentioned this pull request Sep 12, 2022

kvserver: don't exclude drained nodes from upreplication #87858

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: count draining nodes as live when computing quorum #68652

kvserver: count draining nodes as live when computing quorum #68652

aayushshah15 commented Aug 10, 2021 •

edited

Loading

cockroach-teamcity commented Aug 10, 2021

erikgrinaker left a comment

nvanbenschoten left a comment

aayushshah15 commented Aug 13, 2021

nvanbenschoten commented Aug 13, 2021

knz commented Aug 16, 2021

andreimatei left a comment

aayushshah15 left a comment

andreimatei left a comment

andreimatei commented Aug 24, 2021

aayushshah15 left a comment

aayushshah15 commented Sep 2, 2021

craig bot commented Sep 2, 2021

kvserver: count draining nodes as live when computing quorum #68652

kvserver: count draining nodes as live when computing quorum #68652

Conversation

aayushshah15 commented Aug 10, 2021 • edited Loading

cockroach-teamcity commented Aug 10, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 commented Aug 13, 2021

nvanbenschoten commented Aug 13, 2021

knz commented Aug 16, 2021

andreimatei left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei commented Aug 24, 2021

aayushshah15 left a comment

Choose a reason for hiding this comment

aayushshah15 commented Sep 2, 2021

craig bot commented Sep 2, 2021

aayushshah15 commented Aug 10, 2021 •

edited

Loading