-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: don't use ClearRange
point deletes with estimated MVCC stats
#74674
kvserver: don't use ClearRange
point deletes with estimated MVCC stats
#74674
Conversation
@sumeerbhola Do you have an opinion on this assessment?
|
00b70ba
to
b621c07
Compare
ClearRange
point deletes with estimates MVCC statsClearRange
point deletes with estimated MVCC stats
Even if we take the point delete path, wouldn't we prefer to return a resume span before we run into cmd-too-large? |
We limit the point delete path to 512 KB keys+values, so assuming accurate statistics (which we now ensure) there is no way this can exceed 512 KB. I suppose we could keep a running tally and fall back to the range delete (rather than a resume span) if we find out that the batch actually exceeds the estimate, but do we need to? In that case, I think I'd instead prefer just to recompute stats if they're estimates. |
This sounds fine. I am assuming estimated stats are rare.
The interface for
I am curious about this "which we now ensure" -- can you point me to what changed? |
Not that rare. For example, index backfills always use estimated stats across all ranges, since ensuring accurate stats has a significant performance penalty. But it's probably still rare enough that it's unlikely that both 1) the amount of data written is <512 KB, and 2) the number of affected ranges is large enough that the range tombstones now become problematic.
We ensure that we do not take the point deletion path if stats are estimated. We trust that stats are accurate when they are claimed to be, and I believe we have multiple assertions for this during e.g. race builds and tests. |
b621c07
to
6029e98
Compare
6029e98
to
bd5339f
Compare
`ClearRange` avoids dropping a Pebble range tombstone if the amount of data that's deleted is small (<=512 KB), instead dropping point deletions. It uses MVCC statistics to determine this. However, when clearing an entire range, it will rely on the existing range MVCC stats rather than computing them. These range statistics can be highly inaccurate -- in some cases so inaccurate that they even become negative. This in turn can cause `ClearRange` to submit a huge write batch, which gets rejected by Raft with `command too large`. This patch avoids dropping point deletes if the statistics are estimated (which is only the case when clearing an entire range). Alternatively, it could do a full stats recomputation in this case, but entire range deletions seem likely to be large and/or rare enough that dropping a range tombstone is fine. Release note (bug fix): Fixed a bug where deleting data via schema changes (e.g. when dropping an index or table) could fail with a "command too large" error.
bd5339f
to
410d60c
Compare
TFTR! bors r=tbg |
Build succeeded: |
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of cockroachdb#71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR cockroachdb#74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with cockroachdb#74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches cockroachdb#68303. Release note: None
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of cockroachdb#71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR cockroachdb#74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with cockroachdb#74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches cockroachdb#68303. Release note: None
75448: kvserver: use Background() in computeChecksumPostApply goroutine r=erikgrinaker a=tbg On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of #71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR #74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with #74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches #68303. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of #71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR #74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with #74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches #68303. Release note: None
ClearRange
avoids dropping a Pebble range tombstone if the amount ofdata that's deleted is small (<=512 KB), instead dropping point
deletions. It uses MVCC statistics to determine this. However, when
clearing an entire range, it will rely on the existing range MVCC stats
rather than computing them.
These range statistics can be highly inaccurate -- in some cases so
inaccurate that they even become negative. This in turn can cause
ClearRange
to submit a huge write batch, which gets rejected by Raftwith
command too large
.This patch avoids dropping point deletes if the statistics are estimated
(which is only the case when clearing an entire range). Alternatively,
it could do a full stats recomputation in this case, but entire range
deletions seem likely to be large and/or rare enough that dropping a
range tombstone is fine.
Resolves #74686.
Release note (bug fix): Fixed a bug where deleting data via schema
changes (e.g. when dropping an index or table) could fail with a
"command too large" error.