Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: make MVCC garbage collection latch-less #55293

Closed
nvanbenschoten opened this issue Oct 7, 2020 · 3 comments · Fixed by #83213
Closed

kv: make MVCC garbage collection latch-less #55293

nvanbenschoten opened this issue Oct 7, 2020 · 3 comments · Fixed by #83213
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Oct 7, 2020

Relates to #50194.

Garbage collection of MVCC versions is currently excessively disruptive to foreground traffic on the keys being garbage collected. This is because GCRequests acquire write latches over each of the keys being GCed. The idea here is to use latches to protect GC requests from interacting poorly with reads near the GC threshold and to protect them from interacting poorly with other GC requests on the same key (e.g. to avoid duplicate stats changes).

The problem is that both of these uses of latches are broken. The first use is broken because we acquire write latches at the batch header's timestamp, which is set to time.Now(), so we're only serializing with reads in the future and all other writes [1]. So we're disruptive to everyone except who we want to serialize with – reads in the past! The second use is broken because we can already serialize with other GCRequests on a separate key, such as the RangeLastGCKey. So we end up going out of our way to cause GC requests to disrupt foreground traffic and don't get anything from it.

To address these issues, I propose that we make garbage collection latch-less. GC requests should no longer acquire latches on the keys that they intend to GC. Instead, they should only grab a latch on the RangeLastGCKey. To ensure that other requests properly synchronize with GC requests, they will now need to check the GC threshold after evaluation as well [2]. I've mocked this out in this branch: nvanbenschoten@8486020.

[1] I think it's fine for GC requests on old versions of a key to commute with writes to new versions of a key in terms of their stats updates, but we'll need to confirm that.

[2] This could be simplified if we made Engine.NewReadOnly grab an engine snapshot so that evaluation would always work on a stable view of the MVCC keyspace. In that world, we could simply check the GC threshold after calling Engine.NewReadOnly on the read request path.

Jira issue: CRDB-3679

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. labels Oct 7, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 13, 2020
This PR cleans up some handling around replica destruction that scared me
when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during
merges where the destroy status on a replica would not be set until after
that replicas data was destroyed. This was true of merges applied through
entry application, although in those cases, no user data was removed so it
seems mostly harmless. More concerning is that is was true of merges applied
through snapshot application. This was extra concerning because snapshot
application can delete user data if a subsumed range is only partially
covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically
risking user-visible correctness problems, especially now that we allow
follower reads through on subsumed ranges as of cockroachdb#51594.

This PR patches up these issues and then adds a few extra assertions that
enforce stricter preconditions for the functions at play during replica
destruction. Specifically, we now assert that the destroy status is set to
`destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also
assert that if `RemoveOptions.DestroyData` is false when passed to
`removeInitializedReplicaRaftMuLocked`, then the destroy status is also set.

This unification allows us to remove the `ignoreDestroyStatus` flag on
`RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`.
In hindsight, pushing on why this extra flag was needed and what new
information it was conveying to the function could have potentially caught
these issues a little earlier.

I think we'll want to backport this to v20.2, though probably in the first
patch release because there is some risk here (especially without sufficient
time to bake on master) and we aren't aware of seeing any correctness issues
from the combination of the bug fixed here and cockroachdb#51594.

Release note (bug fix): A hypothesized bug that could allow a follower
read to miss data on a range in the middle of being merged away into its
left-hand neighbor was fixed. The bug seems exceedingly unlikely to have
materialized in practice.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 14, 2020
This PR cleans up some handling around replica destruction that scared me
when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during
merges where the destroy status on a replica would not be set until after
that replicas data was destroyed. This was true of merges applied through
entry application, although in those cases, no user data was removed so it
seems mostly harmless. More concerning is that is was true of merges applied
through snapshot application. This was extra concerning because snapshot
application can delete user data if a subsumed range is only partially
covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically
risking user-visible correctness problems, especially now that we allow
follower reads through on subsumed ranges as of cockroachdb#51594.

This PR patches up these issues and then adds a few extra assertions that
enforce stricter preconditions for the functions at play during replica
destruction. Specifically, we now assert that the destroy status is set to
`destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also
assert that if `RemoveOptions.DestroyData` is false when passed to
`removeInitializedReplicaRaftMuLocked`, then the destroy status is also set.

This unification allows us to remove the `ignoreDestroyStatus` flag on
`RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`.
In hindsight, pushing on why this extra flag was needed and what new
information it was conveying to the function could have potentially caught
these issues a little earlier.

I think we'll want to backport this to v20.2, though probably in the first
patch release because there is some risk here (especially without sufficient
time to bake on master) and we aren't aware of seeing any correctness issues
from the combination of the bug fixed here and cockroachdb#51594.

Release note (bug fix): A hypothesized bug that could allow a follower
read to miss data on a range in the middle of being merged away into its
left-hand neighbor was fixed. The bug seems exceedingly unlikely to have
materialized in practice.
craig bot pushed a commit that referenced this issue Oct 15, 2020
55477: kv: set destroy status before destroying data on subsumed replicas r=nvanbenschoten a=nvanbenschoten

This PR cleans up some handling around replica destruction that scared me when working on #46329 and #55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of #51594.

This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set.

This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier.

I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and #51594.

Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 12, 2020
This PR cleans up some handling around replica destruction that scared me
when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during
merges where the destroy status on a replica would not be set until after
that replicas data was destroyed. This was true of merges applied through
entry application, although in those cases, no user data was removed so it
seems mostly harmless. More concerning is that is was true of merges applied
through snapshot application. This was extra concerning because snapshot
application can delete user data if a subsumed range is only partially
covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically
risking user-visible correctness problems, especially now that we allow
follower reads through on subsumed ranges as of cockroachdb#51594.

This PR patches up these issues and then adds a few extra assertions that
enforce stricter preconditions for the functions at play during replica
destruction. Specifically, we now assert that the destroy status is set to
`destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also
assert that if `RemoveOptions.DestroyData` is false when passed to
`removeInitializedReplicaRaftMuLocked`, then the destroy status is also set.

This unification allows us to remove the `ignoreDestroyStatus` flag on
`RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`.
In hindsight, pushing on why this extra flag was needed and what new
information it was conveying to the function could have potentially caught
these issues a little earlier.

I think we'll want to backport this to v20.2, though probably in the first
patch release because there is some risk here (especially without sufficient
time to bake on master) and we aren't aware of seeing any correctness issues
from the combination of the bug fixed here and cockroachdb#51594.

Release note (bug fix): A hypothesized bug that could allow a follower
read to miss data on a range in the middle of being merged away into its
left-hand neighbor was fixed. The bug seems exceedingly unlikely to have
materialized in practice.
@andreimatei
Copy link
Contributor

[2] This could be simplified if we made Engine.NewReadOnly grab an engine snapshot so that evaluation would always work on a stable view of the MVCC keyspace. In that world, we could simply check the GC threshold after calling Engine.NewReadOnly on the read request path.

I think this has happened in #58515

andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 2, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). The assertion is changed in the next commit.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 12, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). The assertion is changed in the next commit.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 16, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). The assertion is changed in the next commit.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 19, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). This patch will help turn the assertion into a
deterministic one in the future.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 20, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). This patch will help turn the assertion into a
deterministic one in the future.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 21, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). This patch will help turn the assertion into a
deterministic one in the future.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 21, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). This patch will help turn the assertion into a
deterministic one in the future.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 22, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). This patch will help turn the assertion into a
deterministic one in the future.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Apr 22, 2021
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). This patch will help turn the assertion into a
deterministic one in the future.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
craig bot pushed a commit that referenced this issue Jul 1, 2022
83197: kvserver: bump the in-memory GC threshold as a pre-apply side effect r=aayushshah15 a=aayushshah15

This commit changes where the in-memory GC threshold on a Replica is
bumped during the application of a GC request.

Previously, the in-memory GC threshold was bumped as a post-apply side
effect. Additionally, GC requests do not acquire latches at the
timestamp that they are garbage collecting, and thus, readers need to
take additional care to ensure that results aren't served off of a
partial state.

Readers today rely on the invariant that the in-memory GC threshold is
bumped before the actual garbage collection. This today is true because
the mvccGCQueue issues GC requests in 2 phases: the first that simply
bumps the in-memory GC threshold, and the second one that performs the
actual garbage collection. If the in-memory GC threshold were bumped in
the pre-apply phase of command application, this usage quirk wouldn't
need to exist. That's what this commit does.

Relates to #55293

Release note: None



83659: kvstreamer: optimize the results buffers a bit r=yuzefovich a=yuzefovich

**kvstreamer: make InOrder-mode structs implement heap logic on their own**

This commit refactors the `inOrderRequestsProvider` as well as the
`inOrderResultsBuffer` to make them maintain the min heap over the
requests and results, respectively, on their own. This allows us to
avoid allocations for `interface{}` objects that occur when using
`heap.Interface`. The code was copied, with minor adjustments, from the
standard library.
```
name                                                  old time/op    new time/op    delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             8.83ms ± 3%    8.58ms ± 3%   -2.82%  (p=0.002 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    10.3ms ± 1%    10.1ms ± 2%   -1.86%  (p=0.009 n=10+10)
LookupJoinOrdering/Cockroach-24                         10.7ms ± 5%    10.5ms ± 3%     ~     (p=0.123 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                14.0ms ± 5%    13.6ms ± 5%   -3.16%  (p=0.043 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             1.77MB ± 2%    1.59MB ± 1%   -9.94%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    2.57MB ± 3%    2.38MB ± 1%   -7.57%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         1.67MB ± 0%    1.50MB ± 1%  -10.08%  (p=0.000 n=9+9)
LookupJoinOrdering/MultinodeCockroach-24                2.53MB ± 2%    2.34MB ± 2%   -7.78%  (p=0.000 n=10+9)

name                                                  old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24              12.5k ± 1%     10.4k ± 1%  -16.54%  (p=0.000 n=9+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24     17.6k ± 0%     15.5k ± 0%  -11.64%  (p=0.000 n=8+10)
LookupJoinOrdering/Cockroach-24                          14.4k ± 1%     12.4k ± 1%  -14.17%  (p=0.000 n=9+9)
LookupJoinOrdering/MultinodeCockroach-24                 20.6k ± 1%     18.5k ± 1%  -10.19%  (p=0.000 n=10+9)
```

Addresses: #82159.

Release note: None

**kvstreamer: refactor the OutOfOrder requests provider a bit**

This commit refactors the `requestsProvider` interface renaming a couple
of methods - `s/firstLocked/nextLocked/` and
`s/removeFirstLocked/removeNextLocked/`. The idea is that for the
InOrder mode "first" == "next", but for the OutOfOrder mode we can pick
an arbitrary request that is not necessarily "first". This commit then
makes the OutOfOrder requests provider pick its last request in the
queue - which should reduce the memory usage in case resume requests are
added later.

Previously, we would slice "next" requests off the front and append to
the end, and it could trigger reallocation of the underlying slice. Now,
we'll be updating only the "tail" of the queue, and since a single
request can result in at most one resume request, the initial slice
provided in `enqueue` will definitely have enough capacity.

Release note: None

**kvstreamer: reduce allocations in the InOrder results buffer**

This commit improves the InOrder results buffer by reusing the same
scratch space for returning the results on `get()` calls as well as by
reducing the size of `inOrderBufferedResult` struct from the 80 bytes
size class to the 64 bytes size class.

It also cleans up a couple of comments and clarifies `GetResults` method
a bit.
```
name                                                  old time/op    new time/op    delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             8.47ms ± 3%    8.54ms ± 2%    ~     (p=0.182 n=9+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    10.2ms ± 2%    10.1ms ± 2%    ~     (p=0.280 n=10+10)
LookupJoinOrdering/Cockroach-24                         10.4ms ± 3%    10.3ms ± 4%    ~     (p=0.393 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                13.8ms ± 5%    13.7ms ± 5%    ~     (p=0.739 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24             1.59MB ± 2%    1.49MB ± 1%  -6.20%  (p=0.000 n=10+9)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24    2.36MB ± 2%    2.27MB ± 2%  -3.49%  (p=0.000 n=10+10)
LookupJoinOrdering/Cockroach-24                         1.51MB ± 1%    1.42MB ± 1%  -6.27%  (p=0.000 n=9+10)
LookupJoinOrdering/MultinodeCockroach-24                2.37MB ± 6%    2.26MB ± 2%  -4.77%  (p=0.000 n=10+9)

name                                                  old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyOrdering/Cockroach-24              10.4k ± 1%     10.3k ± 1%    ~     (p=0.055 n=8+9)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24     15.5k ± 0%     15.5k ± 1%  -0.34%  (p=0.037 n=10+10)
LookupJoinOrdering/Cockroach-24                          12.4k ± 1%     12.3k ± 1%  -0.98%  (p=0.000 n=9+10)
LookupJoinOrdering/MultinodeCockroach-24                 18.5k ± 1%     18.5k ± 2%    ~     (p=0.743 n=8+9)
```

Addresses: #82160.

Release note: None

83705: teamcity: `grep` for pebble version in `go.mod` rather than `DEPS.bzl` r=nicktrav,jbowens a=rickystewart

The `grep` can fail in the case that there is no more recent version of
`pebble` than the one that is vendored in-tree. This fixes that case.

Release note: None

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jul 7, 2022
This commit changes GC requests to no longer declare exclusive latches
at their BatchRequest's timestamp. This was already incorrect as
explained in cockroachdb#55293.

>The first use is broken because we acquire write latches at the batch
header's timestamp, which is set to time.Now(), so we're only
serializing with reads in the future and all other writes [1]. So we're
disruptive to everyone except who we want to serialize with – reads in
the past!

This commit makes GC requests only declare a non-mvcc exclusive latch
over the `RangeGCThresholdKey`. This is correct because:
```

// 1. We define "correctness" to be the property that a reader reading at /
// around the GC threshold will either see the correct results or receive an
// error.
// 2. Readers perform their command evaluation over a stable snapshot of the
// storage engine. This means that the reader will not see the effects of a
// subsequent GC run as long as it created a Pebble iterator before the GC
// request.
// 3. A reader checks the in-memory GC threshold of a Replica after it has
// created this snapshot (i.e. after a Pebble iterator has been created).
// 4. If the in-memory GC threshold is above the timestamp of the read, the
// reader receives an error. Otherwise, the reader is guaranteed to see a
// state of the storage engine that hasn't been affected by the GC request [5].
// 5. GC requests bump the in-memory GC threshold of a Replica as a pre-apply
// side effect. This means that if a reader checks the in-memory GC threshold
// after it has created a Pebble iterator, it is impossible for the iterator
// to point to a storage engine state that has been affected by the GC
// request.

```

As a result, GC requests should now be much less disruptive to
foreground traffic since they're no longer redundantly declaring
exclusive latches over global keys.

Release note (performance improvement): MVCC garbage collection should
now be much less disruptive to foreground traffic than before.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jul 22, 2022
This commit changes GC requests to no longer declare exclusive latches
at their BatchRequest's timestamp. This was already incorrect as
explained in cockroachdb#55293.

>The first use is broken because we acquire write latches at the batch
header's timestamp, which is set to time.Now(), so we're only
serializing with reads in the future and all other writes [1]. So we're
disruptive to everyone except who we want to serialize with – reads in
the past!

This commit makes GC requests only declare a non-mvcc exclusive latch
over the `RangeGCThresholdKey`. This is correct because:
```

// 1. We define "correctness" to be the property that a reader reading at /
// around the GC threshold will either see the correct results or receive an
// error.
// 2. Readers perform their command evaluation over a stable snapshot of the
// storage engine. This means that the reader will not see the effects of a
// subsequent GC run as long as it created a Pebble iterator before the GC
// request.
// 3. A reader checks the in-memory GC threshold of a Replica after it has
// created this snapshot (i.e. after a Pebble iterator has been created).
// 4. If the in-memory GC threshold is above the timestamp of the read, the
// reader receives an error. Otherwise, the reader is guaranteed to see a
// state of the storage engine that hasn't been affected by the GC request [5].
// 5. GC requests bump the in-memory GC threshold of a Replica as a pre-apply
// side effect. This means that if a reader checks the in-memory GC threshold
// after it has created a Pebble iterator, it is impossible for the iterator
// to point to a storage engine state that has been affected by the GC
// request.

```

As a result, GC requests should now be much less disruptive to
foreground traffic since they're no longer redundantly declaring
exclusive latches over global keys.

Release note (performance improvement): MVCC garbage collection should
now be much less disruptive to foreground traffic than before.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 4, 2022
This commit changes GC requests to no longer declare exclusive latches
at their BatchRequest's timestamp. This was already incorrect as
explained in cockroachdb#55293.

>The first use is broken because we acquire write latches at the batch
header's timestamp, which is set to time.Now(), so we're only
serializing with reads in the future and all other writes [1]. So we're
disruptive to everyone except who we want to serialize with – reads in
the past!

This commit makes GC requests only declare a non-mvcc exclusive latch
over the `RangeGCThresholdKey`. This is correct because:
```

// 1. We define "correctness" to be the property that a reader reading at /
// around the GC threshold will either see the correct results or receive an
// error.
// 2. Readers perform their command evaluation over a stable snapshot of the
// storage engine. This means that the reader will not see the effects of a
// subsequent GC run as long as it created a Pebble iterator before the GC
// request.
// 3. A reader checks the in-memory GC threshold of a Replica after it has
// created this snapshot (i.e. after a Pebble iterator has been created).
// 4. If the in-memory GC threshold is above the timestamp of the read, the
// reader receives an error. Otherwise, the reader is guaranteed to see a
// state of the storage engine that hasn't been affected by the GC request [5].
// 5. GC requests bump the in-memory GC threshold of a Replica as a pre-apply
// side effect. This means that if a reader checks the in-memory GC threshold
// after it has created a Pebble iterator, it is impossible for the iterator
// to point to a storage engine state that has been affected by the GC
// request.

```

As a result, GC requests should now be much less disruptive to
foreground traffic since they're no longer redundantly declaring
exclusive latches over global keys.

Release note (performance improvement): MVCC garbage collection should
now be much less disruptive to foreground traffic than before.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 4, 2022
This commit changes GC requests to no longer declare exclusive latches
at their BatchRequest's timestamp. This was already incorrect as
explained in cockroachdb#55293.

>The first use is broken because we acquire write latches at the batch
header's timestamp, which is set to time.Now(), so we're only
serializing with reads in the future and all other writes [1]. So we're
disruptive to everyone except who we want to serialize with – reads in
the past!

This commit makes GC requests only declare a non-mvcc exclusive latch
over the `RangeGCThresholdKey`. This is correct because:
```

// 1. We define "correctness" to be the property that a reader reading at /
// around the GC threshold will either see the correct results or receive an
// error.
// 2. Readers perform their command evaluation over a stable snapshot of the
// storage engine. This means that the reader will not see the effects of a
// subsequent GC run as long as it created a Pebble iterator before the GC
// request.
// 3. A reader checks the in-memory GC threshold of a Replica after it has
// created this snapshot (i.e. after a Pebble iterator has been created).
// 4. If the in-memory GC threshold is above the timestamp of the read, the
// reader receives an error. Otherwise, the reader is guaranteed to see a
// state of the storage engine that hasn't been affected by the GC request [5].
// 5. GC requests bump the in-memory GC threshold of a Replica as a pre-apply
// side effect. This means that if a reader checks the in-memory GC threshold
// after it has created a Pebble iterator, it is impossible for the iterator
// to point to a storage engine state that has been affected by the GC
// request.

```

As a result, GC requests should now be much less disruptive to
foreground traffic since they're no longer redundantly declaring
exclusive latches over global keys.

Release note (performance improvement): MVCC garbage collection should
now be much less disruptive to foreground traffic than before.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 8, 2022
This commit changes GC requests to no longer declare exclusive latches
at their BatchRequest's timestamp. This was already incorrect as
explained in cockroachdb#55293.

>The first use is broken because we acquire write latches at the batch
header's timestamp, which is set to time.Now(), so we're only
serializing with reads in the future and all other writes [1]. So we're
disruptive to everyone except who we want to serialize with – reads in
the past!

This commit makes GC requests only declare a non-mvcc exclusive latch
over the `RangeGCThresholdKey`. This is correct because:
```

// 1. We define "correctness" to be the property that a reader reading at /
// around the GC threshold will either see the correct results or receive an
// error.
// 2. Readers perform their command evaluation over a stable snapshot of the
// storage engine. This means that the reader will not see the effects of a
// subsequent GC run as long as it created a Pebble iterator before the GC
// request.
// 3. A reader checks the in-memory GC threshold of a Replica after it has
// created this snapshot (i.e. after a Pebble iterator has been created).
// 4. If the in-memory GC threshold is above the timestamp of the read, the
// reader receives an error. Otherwise, the reader is guaranteed to see a
// state of the storage engine that hasn't been affected by the GC request [5].
// 5. GC requests bump the in-memory GC threshold of a Replica as a pre-apply
// side effect. This means that if a reader checks the in-memory GC threshold
// after it has created a Pebble iterator, it is impossible for the iterator
// to point to a storage engine state that has been affected by the GC
// request.

```

As a result, GC requests should now be much less disruptive to
foreground traffic since they're no longer redundantly declaring
exclusive latches over global keys.

Release note (performance improvement): MVCC garbage collection should
now be much less disruptive to foreground traffic than before.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Aug 9, 2022
This commit changes GC requests to no longer declare exclusive latches
at their BatchRequest's timestamp. This was already incorrect as
explained in cockroachdb#55293.

>The first use is broken because we acquire write latches at the batch
header's timestamp, which is set to time.Now(), so we're only
serializing with reads in the future and all other writes [1]. So we're
disruptive to everyone except who we want to serialize with – reads in
the past!

This commit makes GC requests only declare a non-mvcc exclusive latch
over the `RangeGCThresholdKey`. This is correct because:
```

// 1. We define "correctness" to be the property that a reader reading at /
// around the GC threshold will either see the correct results or receive an
// error.
// 2. Readers perform their command evaluation over a stable snapshot of the
// storage engine. This means that the reader will not see the effects of a
// subsequent GC run as long as it created a Pebble iterator before the GC
// request.
// 3. A reader checks the in-memory GC threshold of a Replica after it has
// created this snapshot (i.e. after a Pebble iterator has been created).
// 4. If the in-memory GC threshold is above the timestamp of the read, the
// reader receives an error. Otherwise, the reader is guaranteed to see a
// state of the storage engine that hasn't been affected by the GC request [5].
// 5. GC requests bump the in-memory GC threshold of a Replica as a pre-apply
// side effect. This means that if a reader checks the in-memory GC threshold
// after it has created a Pebble iterator, it is impossible for the iterator
// to point to a storage engine state that has been affected by the GC
// request.

```

As a result, GC requests should now be much less disruptive to
foreground traffic since they're no longer redundantly declaring
exclusive latches over global keys.

Release note (performance improvement): MVCC garbage collection should
now be much less disruptive to foreground traffic than before.
@craig craig bot closed this as completed in 7c18668 Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants