Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: allow removing leaseholder in ChangeReplicas #40333

Closed
tbg opened this issue Aug 29, 2019 · 3 comments
Closed

storage: allow removing leaseholder in ChangeReplicas #40333

tbg opened this issue Aug 29, 2019 · 3 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity T-kv KV Team X-stale

Comments

@tbg
Copy link
Member

tbg commented Aug 29, 2019

The replicate queue code is littered with decisions to transfer a lease away simply because the leaseholder should be removed. When the replication factor is low or there are constraints on the range it can happen that we can't transfer this lease away until we've added another replica. Adding another replica before removing the leaseholder is precisely what we are trying to avoid in #12768.

TestInitialPartitioning definitely hits this problem (it will fail if rebalancing is always carried out atomically), so it's a good starting point for investigation.

The lease transfer code that the queue uses (centered around allocator.TransferLeaseTarget) is very hard to reason about as well, so I'm not even sure if maybe it would fail to find a target when there is one sometimes.

With all that in mind it'd be nice if we could just issue any replication change, including one that removes the leaseholder. The reason we don't allow this today is that this will either wedge the range (because the lease will remain active, but the leader isn't serving the range any more) or cause potential anomalies (if we allow another node to get the lease without properly invalidating the old one).

The right way to make this work, I think, is to use the fact that the replica change removing the leaseholder is necessarily evaluated and proposed on the leaseholder. If we intercept that request accordingly and make sure that it causes the leaseholder to behave as if a TransferLease had been issued (setting its minProposedTS, etc) then we could treat a lease whose leaseholder store isn't in the descriptor as invalid (even if the epoch is still active), meaning that any member of the range could obtain the lease immediately after the replica change is through.
We'll need to allow leases to be obtained by VOTER_INCOMING replicas, but this is fine.
We could also add a leaseholder hint to the replica change to fix the lease transfer target, which is something the replicate queue (or allocator 2.0) would want to do.

Jira issue: CRDB-5549

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 29, 2019
tbg added a commit to tbg/cockroach that referenced this issue Aug 30, 2019
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing.
TestInitialPartitioning helpfully points out (once you flip atomic
rebalancing on) that when the replication factor is one, there is
no way to perform such an atomic swap because it will necessarily
have to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\cockroachdb#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and
hopefully the lease will be transferred over and then the original
leaseholder removed. I would be very doubtful that this all works,
but it is how things worked until cockroachdb#40284, so this PR really just
falls back to the previous behavior in cases where we can't do
better.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 3, 2019
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing.
TestInitialPartitioning helpfully points out (once you flip atomic
rebalancing on) that when the replication factor is one, there is
no way to perform such an atomic swap because it will necessarily
have to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\cockroachdb#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and
hopefully the lease will be transferred over and then the original
leaseholder removed. I would be very doubtful that this all works,
but it is how things worked until cockroachdb#40284, so this PR really just
falls back to the previous behavior in cases where we can't do
better.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 3, 2019
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing.
TestInitialPartitioning helpfully points out (once you flip atomic
rebalancing on) that when the replication factor is one, there is
no way to perform such an atomic swap because it will necessarily
have to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\cockroachdb#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and
hopefully the lease will be transferred over and then the original
leaseholder removed. I would be very doubtful that this all works,
but it is how things worked until cockroachdb#40284, so this PR really just
falls back to the previous behavior in cases where we can't do
better.

Release note: None
craig bot pushed a commit that referenced this issue Sep 3, 2019
40363: storage: work around can't-swap-leaseholder r=nvanbenschoten a=tbg

As of #40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing. TestInitialPartitioning helpfully points out (once you
flip atomic rebalancing on) that when the replication factor is one, there
is no way to perform such an atomic swap because it will necessarily have
to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and hopefully
the lease will be transferred over and then the original leaseholder
removed. I would be very doubtful that this all works, but it is how things
worked until #40284, so this PR really just falls back to the previous
behavior in cases where we can't do better.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 20, 2019
Prior to this commit the partitioning tests worked by creating a 3 node cluster
and then expressed constraints over the three nodes. It then validates that
the cluster conforms to the constraints by querying data and examining the
trace to determine which node held the data.

This is problematic for one because it is succeptible to cockroachdb#40333. In rare
cases we'll down-replicate to the wrong single node (e.g. if the right one
is not live) and we won't ever fix it.

It also doesn't exercise leaseholder preferences.

This PR adds functionality to configure clusters with larger numbers of nodes
where each expectation in the config can now refer to a leaseholder_preference
rather than a constraint and we'll allocate the additional nodes to 3
datacenters.

This larger test creates dramatically more data movement and has been useful
when testing cockroachdb#40892.

Release justification: Only touches testing and is useful for testing a
release blocker.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Oct 1, 2019
Prior to this commit the partitioning tests worked by creating a 3 node cluster
and then expressed constraints over the three nodes. It then validates that
the cluster conforms to the constraints by querying data and examining the
trace to determine which node held the data.

This is problematic for one because it is succeptible to cockroachdb#40333. In rare
cases we'll down-replicate to the wrong single node (e.g. if the right one
is not live) and we won't ever fix it.

It also doesn't exercise leaseholder preferences.

This PR adds functionality to configure clusters with larger numbers of nodes
where each expectation in the config can now refer to a leaseholder_preference
rather than a constraint and we'll allocate the additional nodes to 3
datacenters.

This larger test creates dramatically more data movement and has been useful
when testing cockroachdb#40892.

The PR also adds a flag to control how many of these subtests to run.

Release justification: Only touches testing and is useful for testing a
release blocker.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Nov 25, 2019
Prior to this commit the partitioning tests worked by creating a 3 node cluster
and then expressed constraints over the three nodes. It then validates that
the cluster conforms to the constraints by querying data and examining the
trace to determine which node held the data.

This is problematic for one because it is succeptible to cockroachdb#40333. In rare
cases we'll down-replicate to the wrong single node (e.g. if the right one
is not live) and we won't ever fix it.

It also doesn't exercise leaseholder preferences.

This PR adds functionality to configure clusters with larger numbers of nodes
where each expectation in the config can now refer to a leaseholder_preference
rather than a constraint and we'll allocate the additional nodes to 3
datacenters.

This larger test creates dramatically more data movement and has been useful
when testing cockroachdb#40892.

The PR also adds a flag to control how many of these subtests to run.

Release justification: Only touches testing and is useful for testing a
release blocker.

Release note: None
@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023
@nvanbenschoten
Copy link
Member

This was addressed by #74077.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants