kv: make MVCC garbage collection latch-less #55293

nvanbenschoten · 2020-10-07T16:14:58Z

Relates to #50194.

Garbage collection of MVCC versions is currently excessively disruptive to foreground traffic on the keys being garbage collected. This is because GCRequests acquire write latches over each of the keys being GCed. The idea here is to use latches to protect GC requests from interacting poorly with reads near the GC threshold and to protect them from interacting poorly with other GC requests on the same key (e.g. to avoid duplicate stats changes).

The problem is that both of these uses of latches are broken. The first use is broken because we acquire write latches at the batch header's timestamp, which is set to time.Now(), so we're only serializing with reads in the future and all other writes [1]. So we're disruptive to everyone except who we want to serialize with – reads in the past! The second use is broken because we can already serialize with other GCRequests on a separate key, such as the RangeLastGCKey. So we end up going out of our way to cause GC requests to disrupt foreground traffic and don't get anything from it.

To address these issues, I propose that we make garbage collection latch-less. GC requests should no longer acquire latches on the keys that they intend to GC. Instead, they should only grab a latch on the RangeLastGCKey. To ensure that other requests properly synchronize with GC requests, they will now need to check the GC threshold after evaluation as well [2]. I've mocked this out in this branch: nvanbenschoten@8486020.

[1] I think it's fine for GC requests on old versions of a key to commute with writes to new versions of a key in terms of their stats updates, but we'll need to confirm that.

[2] This could be simplified if we made Engine.NewReadOnly grab an engine snapshot so that evaluation would always work on a stable view of the MVCC keyspace. In that world, we could simply check the GC threshold after calling Engine.NewReadOnly on the read request path.

Jira issue: CRDB-3679

The text was updated successfully, but these errors were encountered:

This PR cleans up some handling around replica destruction that scared me when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of cockroachdb#51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and cockroachdb#51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

55477: kv: set destroy status before destroying data on subsumed replicas r=nvanbenschoten a=nvanbenschoten This PR cleans up some handling around replica destruction that scared me when working on #46329 and #55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of #51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and #51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

This PR cleans up some handling around replica destruction that scared me when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of cockroachdb#51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and cockroachdb#51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

andreimatei · 2021-04-02T17:31:17Z

[2] This could be simplified if we made Engine.NewReadOnly grab an engine snapshot so that evaluation would always work on a stable view of the MVCC keyspace. In that world, we could simply check the GC threshold after calling Engine.NewReadOnly on the read request path.

I think this has happened in #58515

The ReplicatedEvalResult.Timestamp used to carry the read timestamp of the request that proposed the respective command. This patch renames it to WriteTimestamp and sets it accordingly. The field is used for two purposes: for bumping the clock on followers such that they don't have values above their clock, and for checking that we're not writing below the GC threshold. Both of these use cases actually want the write timestamps, so they were broken. The second use seems dubious to me (I don't think it's actually needed) but this patch does not take a position on it beyond adding a TODO. Beyond fixing the existing uses of this field, putting the write timestamp in every Raft command has the added benefit that we can use it to assert below Raft that nobody tries to write below the closed timestamp. Before this patch, only the leaseholder was able to do this assertion (because it was the only replica that had access to the write timestamp) and so the leaseholder had to do it as a non-deterministic error (and non-deterministic errors are nasty, even when they're assertion failures). The assertion is changed in the next commit. In addition, by clarifying the meaning of this field, this patch opens the door for cockroachdb#62569 to muddy the field back and give it a special meaning for lease transfers - the lease start time. Also mentioning cockroachdb#55293 because this patch lightly touches on the interaction between requests and the GC threshold. Release note: None

The ReplicatedEvalResult.Timestamp used to carry the read timestamp of the request that proposed the respective command. This patch renames it to WriteTimestamp and sets it accordingly. The field is used for two purposes: for bumping the clock on followers such that they don't have values above their clock, and for checking that we're not writing below the GC threshold. Both of these use cases actually want the write timestamps, so they were broken. The second use seems dubious to me (I don't think it's actually needed) but this patch does not take a position on it beyond adding a TODO. Beyond fixing the existing uses of this field, putting the write timestamp in every Raft command has the added benefit that we can use it to assert below Raft that nobody tries to write below the closed timestamp. Before this patch, only the leaseholder was able to do this assertion (because it was the only replica that had access to the write timestamp) and so the leaseholder had to do it as a non-deterministic error (and non-deterministic errors are nasty, even when they're assertion failures). This patch will help turn the assertion into a deterministic one in the future. In addition, by clarifying the meaning of this field, this patch opens the door for cockroachdb#62569 to muddy the field back and give it a special meaning for lease transfers - the lease start time. Also mentioning cockroachdb#55293 because this patch lightly touches on the interaction between requests and the GC threshold. Release note: None

83197: kvserver: bump the in-memory GC threshold as a pre-apply side effect r=aayushshah15 a=aayushshah15 This commit changes where the in-memory GC threshold on a Replica is bumped during the application of a GC request. Previously, the in-memory GC threshold was bumped as a post-apply side effect. Additionally, GC requests do not acquire latches at the timestamp that they are garbage collecting, and thus, readers need to take additional care to ensure that results aren't served off of a partial state. Readers today rely on the invariant that the in-memory GC threshold is bumped before the actual garbage collection. This today is true because the mvccGCQueue issues GC requests in 2 phases: the first that simply bumps the in-memory GC threshold, and the second one that performs the actual garbage collection. If the in-memory GC threshold were bumped in the pre-apply phase of command application, this usage quirk wouldn't need to exist. That's what this commit does. Relates to #55293 Release note: None 83659: kvstreamer: optimize the results buffers a bit r=yuzefovich a=yuzefovich **kvstreamer: make InOrder-mode structs implement heap logic on their own** This commit refactors the `inOrderRequestsProvider` as well as the `inOrderResultsBuffer` to make them maintain the min heap over the requests and results, respectively, on their own. This allows us to avoid allocations for `interface{}` objects that occur when using `heap.Interface`. The code was copied, with minor adjustments, from the standard library. ``` name old time/op new time/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 8.83ms ± 3% 8.58ms ± 3% -2.82% (p=0.002 n=10+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 10.3ms ± 1% 10.1ms ± 2% -1.86% (p=0.009 n=10+10) LookupJoinOrdering/Cockroach-24 10.7ms ± 5% 10.5ms ± 3% ~ (p=0.123 n=10+10) LookupJoinOrdering/MultinodeCockroach-24 14.0ms ± 5% 13.6ms ± 5% -3.16% (p=0.043 n=10+10) name old alloc/op new alloc/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 1.77MB ± 2% 1.59MB ± 1% -9.94% (p=0.000 n=10+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 2.57MB ± 3% 2.38MB ± 1% -7.57% (p=0.000 n=10+10) LookupJoinOrdering/Cockroach-24 1.67MB ± 0% 1.50MB ± 1% -10.08% (p=0.000 n=9+9) LookupJoinOrdering/MultinodeCockroach-24 2.53MB ± 2% 2.34MB ± 2% -7.78% (p=0.000 n=10+9) name old allocs/op new allocs/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 12.5k ± 1% 10.4k ± 1% -16.54% (p=0.000 n=9+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 17.6k ± 0% 15.5k ± 0% -11.64% (p=0.000 n=8+10) LookupJoinOrdering/Cockroach-24 14.4k ± 1% 12.4k ± 1% -14.17% (p=0.000 n=9+9) LookupJoinOrdering/MultinodeCockroach-24 20.6k ± 1% 18.5k ± 1% -10.19% (p=0.000 n=10+9) ``` Addresses: #82159. Release note: None **kvstreamer: refactor the OutOfOrder requests provider a bit** This commit refactors the `requestsProvider` interface renaming a couple of methods - `s/firstLocked/nextLocked/` and `s/removeFirstLocked/removeNextLocked/`. The idea is that for the InOrder mode "first" == "next", but for the OutOfOrder mode we can pick an arbitrary request that is not necessarily "first". This commit then makes the OutOfOrder requests provider pick its last request in the queue - which should reduce the memory usage in case resume requests are added later. Previously, we would slice "next" requests off the front and append to the end, and it could trigger reallocation of the underlying slice. Now, we'll be updating only the "tail" of the queue, and since a single request can result in at most one resume request, the initial slice provided in `enqueue` will definitely have enough capacity. Release note: None **kvstreamer: reduce allocations in the InOrder results buffer** This commit improves the InOrder results buffer by reusing the same scratch space for returning the results on `get()` calls as well as by reducing the size of `inOrderBufferedResult` struct from the 80 bytes size class to the 64 bytes size class. It also cleans up a couple of comments and clarifies `GetResults` method a bit. ``` name old time/op new time/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 8.47ms ± 3% 8.54ms ± 2% ~ (p=0.182 n=9+10) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 10.2ms ± 2% 10.1ms ± 2% ~ (p=0.280 n=10+10) LookupJoinOrdering/Cockroach-24 10.4ms ± 3% 10.3ms ± 4% ~ (p=0.393 n=10+10) LookupJoinOrdering/MultinodeCockroach-24 13.8ms ± 5% 13.7ms ± 5% ~ (p=0.739 n=10+10) name old alloc/op new alloc/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 1.59MB ± 2% 1.49MB ± 1% -6.20% (p=0.000 n=10+9) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 2.36MB ± 2% 2.27MB ± 2% -3.49% (p=0.000 n=10+10) LookupJoinOrdering/Cockroach-24 1.51MB ± 1% 1.42MB ± 1% -6.27% (p=0.000 n=9+10) LookupJoinOrdering/MultinodeCockroach-24 2.37MB ± 6% 2.26MB ± 2% -4.77% (p=0.000 n=10+9) name old allocs/op new allocs/op delta LookupJoinEqColsAreKeyOrdering/Cockroach-24 10.4k ± 1% 10.3k ± 1% ~ (p=0.055 n=8+9) LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24 15.5k ± 0% 15.5k ± 1% -0.34% (p=0.037 n=10+10) LookupJoinOrdering/Cockroach-24 12.4k ± 1% 12.3k ± 1% -0.98% (p=0.000 n=9+10) LookupJoinOrdering/MultinodeCockroach-24 18.5k ± 1% 18.5k ± 2% ~ (p=0.743 n=8+9) ``` Addresses: #82160. Release note: None 83705: teamcity: `grep` for pebble version in `go.mod` rather than `DEPS.bzl` r=nicktrav,jbowens a=rickystewart The `grep` can fail in the case that there is no more recent version of `pebble` than the one that is vendored in-tree. This fixes that case. Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

This commit changes GC requests to no longer declare exclusive latches at their BatchRequest's timestamp. This was already incorrect as explained in cockroachdb#55293. >The first use is broken because we acquire write latches at the batch header's timestamp, which is set to time.Now(), so we're only serializing with reads in the future and all other writes [1]. So we're disruptive to everyone except who we want to serialize with – reads in the past! This commit makes GC requests only declare a non-mvcc exclusive latch over the `RangeGCThresholdKey`. This is correct because: ``` // 1. We define "correctness" to be the property that a reader reading at / // around the GC threshold will either see the correct results or receive an // error. // 2. Readers perform their command evaluation over a stable snapshot of the // storage engine. This means that the reader will not see the effects of a // subsequent GC run as long as it created a Pebble iterator before the GC // request. // 3. A reader checks the in-memory GC threshold of a Replica after it has // created this snapshot (i.e. after a Pebble iterator has been created). // 4. If the in-memory GC threshold is above the timestamp of the read, the // reader receives an error. Otherwise, the reader is guaranteed to see a // state of the storage engine that hasn't been affected by the GC request [5]. // 5. GC requests bump the in-memory GC threshold of a Replica as a pre-apply // side effect. This means that if a reader checks the in-memory GC threshold // after it has created a Pebble iterator, it is impossible for the iterator // to point to a storage engine state that has been affected by the GC // request. ``` As a result, GC requests should now be much less disruptive to foreground traffic since they're no longer redundantly declaring exclusive latches over global keys. Release note (performance improvement): MVCC garbage collection should now be much less disruptive to foreground traffic than before.

nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. labels Oct 7, 2020

nvanbenschoten mentioned this issue Oct 12, 2020

kv: batch evaluation should operate on a consistent storage snapshot #55461

Closed

nvanbenschoten mentioned this issue Oct 13, 2020

kv: set destroy status before destroying data on subsumed replicas #55477

Merged

nvanbenschoten mentioned this issue Oct 19, 2020

release-20.2: kv: set destroy status before destroying data on subsumed replicas #55691

Merged

jlinder added the T-kv KV Team label Jun 16, 2021

craig bot closed this as completed in 7c18668 Aug 9, 2022

aayushshah15 mentioned this issue Aug 30, 2022

kvnemesis: add mvcc gc operations #87143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: make MVCC garbage collection latch-less #55293

kv: make MVCC garbage collection latch-less #55293

nvanbenschoten commented Oct 7, 2020 •

edited by cockroach-jira-scripts

Loading

andreimatei commented Apr 2, 2021

kv: make MVCC garbage collection latch-less #55293

kv: make MVCC garbage collection latch-less #55293

Comments

nvanbenschoten commented Oct 7, 2020 • edited by cockroach-jira-scripts Loading

andreimatei commented Apr 2, 2021

nvanbenschoten commented Oct 7, 2020 •

edited by cockroach-jira-scripts

Loading