-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: count draining nodes as live when computing quorum #68652
kvserver: count draining nodes as live when computing quorum #68652
Conversation
dfa244d
to
5665822
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly , though I'm a little unclear on why we want draining nodes to be considered live for the purposes of computing quorum. Is my understanding correct that once a node has finished draining, it will terminate and become non-live? If so, then why are we finding ourselves in states where we have 3 nodes that are draining and also have a node performing a decommission and we need these draining nodes to contribute to quorum? If the nodes completed draining and terminated then they certainly wouldn't contribute to quorum, so aren't the expectations here racy? Put differently, if a draining node can at any moment finish draining and terminate, then isn't a draining node as good as dead?
Reviewed 5 of 5 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/store_pool.go, line 706 at r1 (raw file):
// No-op. case storeStatusSuspect, storeStatusDraining: if includeSuspectAndDrainingStores {
Did any unit tests pick up this change? If not, are we missing some test coverage?
pkg/kv/kvserver/store_rebalancer.go, line 429 at r2 (raw file):
preferred := sr.rq.allocator.preferredLeaseholders(zone, candidates) // Filter both the list of preferred stores as well as the list of all
Do we have a way to unit test this change in behavior? Ideally in something targetted like TestChooseLeaseToTransfer
.
pkg/kv/kvserver/store_rebalancer.go, line 434 at r2 (raw file):
const includeSuspectAndDrainingStores = false preferred, _ = sr.rq.allocator.storePool.liveAndDeadReplicas(preferred, includeSuspectAndDrainingStores) candidates, _ = sr.rq.allocator.storePool.liveAndDeadReplicas(candidates, includeSuspectAndDrainingStores)
Were we previously able to transfer the lease to dead nodes as well? I don't see what was preventing that.
The nodes wont terminate once the draining process completes, they stay in this state of not accepting new leases/replicas and new SQL connections until they're either restarted or terminated explicitly. One of our customers had a workflow where if they had to remove, say 10 nodes from the cluster, they would first mark all these 10 nodes as draining and then gradually decommission them (this was to avoid moving replicas to node that are soon to be decommissioned anyway). One of the main reasons this decommissioning would get stuck is because the allocator would incorrectly think that a bunch of ranges were unavailable (and thus would not take any action on them). You're right about the lacking unit test coverage, I'll update the PR with some. |
Got it, thanks for the explanation. In that case, I agree with the changes that this PR is making. |
1de5f62
to
58494cf
Compare
I really like this PR a lot. The explanation you provide above is such a good use case, and it deserves to be in the release note so the docs team can pick it up once this change is in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):
Otherwise, they are excluded from the returned slices.
Is it true that replicas can be excluded from both results? So, if this flag is not passed, the caller needs to consider three categories of replicas - live, dead, neither, with the latter not being explicitly represented? This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas
slice?
pkg/kv/kvserver/store_pool.go, line 239 at r4 (raw file):
storeStatusAvailable // The store is decommissioning. storeStatusDecommissioning
maybe take the opportunity to explain the priority between this and dead/suspect/draining.
pkg/kv/kvserver/store_pool.go, line 240 at r4 (raw file):
// The store is decommissioning. storeStatusDecommissioning // The store failed it's liveness heartbeat recently and is considered
maybe take the opportunity to explain that stores move from dead to suspicious.
pkg/kv/kvserver/store_pool.go, line 833 at r4 (raw file):
// the required attributes and their associated stats. The storeList is filtered // according to the provided storeFilter. It also returns the total number of // alive and throttled stores.
The comment on returning the number of throttled stores seems obsolete. Perhaps take the opportunity to explain the last retval; it seems like a weird return value because it doesn't include store ids.
58494cf
to
e97935e
Compare
e97935e
to
c63875d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation you provide above is such a good use case
Even with this patch, we cannot actually support draining >= RF nodes at the same time where RF is the lowest replication factor of some zone in the cluster. This is because decommissioning any of these draining nodes will require lease transfers away from the decommissioning node (since the leaseholders on the decommissioning node cannot remove themselves). These lease transfers need to succeed, so every range needs to have at least one replica on a non-draining node.
This patch does let the user simultaneously drain RF-1 nodes from their cluster though. I've amended the release note to reflect this.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):
This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?
The callers use that deadReplicas
slice essentially as the list of replicas that need to be replaced -- which is not what we'd like to do for replicas on draining or suspect nodes. So in this case, it feels like we do need the disambiguation.
pkg/kv/kvserver/store_pool.go, line 706 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did any unit tests pick up this change? If not, are we missing some test coverage?
Added a short new test for it.
pkg/kv/kvserver/store_pool.go, line 239 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
maybe take the opportunity to explain the priority between this and dead/suspect/draining.
Done.
pkg/kv/kvserver/store_pool.go, line 240 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
maybe take the opportunity to explain that stores move from dead to suspicious.
Done.
pkg/kv/kvserver/store_pool.go, line 833 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
The comment on returning the number of throttled stores seems obsolete. Perhaps take the opportunity to explain the last retval; it seems like a weird return value because it doesn't include store ids.
Fixed.
pkg/kv/kvserver/store_rebalancer.go, line 429 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we have a way to unit test this change in behavior? Ideally in something targetted like
TestChooseLeaseToTransfer
.
TestChooseLeaseToTransfer
is getting rehauled in #65379. Would you mind if I handled it there?
pkg/kv/kvserver/store_rebalancer.go, line 434 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Were we previously able to transfer the lease to dead nodes as well? I don't see what was preventing that.
Yeah, nothing was preventing that.
c63875d
to
a7e323f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of our customers had a workflow where if they had to remove, say 10 nodes from the cluster, they would first mark all these 10 nodes as draining and then gradually decommission them (this was to avoid moving replicas to node that are soon to be decommissioned anyway). One of the main reasons this decommissioning would get stuck is because the allocator would incorrectly think that a bunch of ranges were unavailable (and thus would not take any action on them).
Please include this paragraph in the commit message somehow.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?
The callers use that
deadReplicas
slice essentially as the list of replicas that need to be replaced -- which is not what we'd like to do for replicas on draining or suspect nodes. So in this case, it feels like we do need the disambiguation.
Ack on the distinction between dead and suspect/draining.
I won't make a big deal out of it, but I'd still encourage figuring out a different format for the returned values. Perhaps get rid of includeSuspectStores
and return a replicasWithStatus
object with filtering methods that let the caller more explicitly choose what they're looking at.
pkg/kv/kvserver/store_pool.go, line 239 at r13 (raw file):
storeStatusAvailable // The store is decommissioning. If draining or suspect stores are // decommissioned, this status takes precedence over `storeStatusDraining`
nit: s/decommissioned/decommissioning
how come the 2nd commit doesn't close cockroachlabs/support#1105 ? |
a7e323f
to
812c74c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come the 2nd commit doesn't close cockroachlabs/support#1105 ?
I think its possible we still have an undiagnosed issue with draining nodes that are failing liveness heartbeats. I think #67714 plus this patch may solve it but I'd like to test it out with that customer before we close that ticket.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Ack on the distinction between dead and suspect/draining.
I won't make a big deal out of it, but I'd still encourage figuring out a different format for the returned values. Perhaps get rid ofincludeSuspectStores
and return areplicasWithStatus
object with filtering methods that let the caller more explicitly choose what they're looking at.
👍 Agreed. Added a TODO for now since there's a ton of callers for this function and I'd like to keep the patch small.
pkg/kv/kvserver/store_pool.go, line 239 at r13 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: s/decommissioned/decommissioning
done.
Similar to cockroachdb#67714 Draining nodes were considered non-live by the allocator when it made the determination of whether a range could achieve quorum. This meant that, for instance, on a cluster with a replication factor of 5, if we had 3 or more nodes marked draining, we (with a high likelihood) wouldn't be able to decommission nodes from the cluster. Furthermore, due to the same reason as above the system also would incorrectly decide to not rebalance ranges that had more than a quorum of replicas on draining nodes. This patch fixes this problem by considering replicas on draining nodes as live for the purposes of determining whether a range has quorum. This likely fixes a considerable subset of "stuck decommissioning" issues we've seen in the wild. Follows from https://github.com/cockroachlabs/support/issues/1105 Release note: None
…toreRebalancer This commit prevents the StoreRebalancer from transferring leases to replicas on draining or suspect nodes. In some cases, we've seen this to cause new leases to be pushed to nodes that take too long to drain or that are stuck while draining due to other bugs. Informs https://github.com/cockroachlabs/support/issues/1105 Release note: None
This commit adds a roachtest that's meant to be a regression test against the hazard addressed by the first commit in this PR. This roachtest is meant to ensure that nodes that are marked "draining" are considered "live" by the allocator for when it makes the determination of whether a range can achieve quorum. Release note: None
812c74c
to
ff2ae38
Compare
CI was failing due to missing release justification. TFTRs! bors r+ |
Build succeeded: |
Similar to #67714
Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission any other nodes from the cluster.
Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.
This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
subset of "stuck decommissioning" issues we've seen in the wild.
Follows from cockroachlabs/support#1105
Release justification: bug fix
Release note(bug fix): Previously, draining a quorum of nodes (i.e. >=2 if
replication factor is 3, >=3 if replication factor is 5, etc) would block the
subsequent decommissioning of any other nodes in the cluster. This patch fixes
this bug. Now, if the lowest replication factor of some zone in the cluster is
RF, operators should be able to safely drain up to RF-1 nodes simultaneously.