kvserver: don't use `ClearRange` point deletes with estimated MVCC stats #74674

erikgrinaker · 2022-01-11T12:15:03Z

ClearRange avoids dropping a Pebble range tombstone if the amount of
data that's deleted is small (<=512 KB), instead dropping point
deletions. It uses MVCC statistics to determine this. However, when
clearing an entire range, it will rely on the existing range MVCC stats
rather than computing them.

These range statistics can be highly inaccurate -- in some cases so
inaccurate that they even become negative. This in turn can cause
ClearRange to submit a huge write batch, which gets rejected by Raft
with command too large.

This patch avoids dropping point deletes if the statistics are estimated
(which is only the case when clearing an entire range). Alternatively,
it could do a full stats recomputation in this case, but entire range
deletions seem likely to be large and/or rare enough that dropping a
range tombstone is fine.

Resolves #74686.

Release note (bug fix): Fixed a bug where deleting data via schema
changes (e.g. when dropping an index or table) could fail with a
"command too large" error.

cockroach-teamcity · 2022-01-11T12:15:12Z

This change is

erikgrinaker · 2022-01-11T12:15:30Z

@sumeerbhola Do you have an opinion on this assessment?

This patch avoids dropping point deletes if the statistics are estimated (which is only the case when clearing an entire range). Alternatively, it could do a full stats recomputation in this case, but entire range deletions seem likely to be large and/or rare enough that dropping a range tombstone is fine.

dt · 2022-01-11T13:03:10Z

Even if we take the point delete path, wouldn't we prefer to return a resume span before we run into cmd-too-large?

erikgrinaker · 2022-01-11T13:36:05Z

Even if we take the point delete path, wouldn't we prefer to return a resume span before we run into cmd-too-large?

We limit the point delete path to 512 KB keys+values, so assuming accurate statistics (which we now ensure) there is no way this can exceed 512 KB. I suppose we could keep a running tally and fall back to the range delete (rather than a resume span) if we find out that the batch actually exceeds the estimate, but do we need to? In that case, I think I'd instead prefer just to recompute stats if they're estimates.

sumeerbhola · 2022-01-11T15:46:30Z

@sumeerbhola Do you have an opinion on this assessment?

This sounds fine. I am assuming estimated stats are rare.

Even if we take the point delete path, wouldn't we prefer to return a resume span before we run into cmd-too-large?

The interface for ClearIterRange is not setup for this -- maybe a wider change would still be backportable, but doesn't seem necessary assuming ContainsEstimates is not lying.

so assuming accurate statistics (which we now ensure) there is no way this can exceed 512 KB

I am curious about this "which we now ensure" -- can you point me to what changed?

erikgrinaker · 2022-01-11T15:53:38Z

This sounds fine. I am assuming estimated stats are rare.

Not that rare. For example, index backfills always use estimated stats across all ranges, since ensuring accurate stats has a significant performance penalty. But it's probably still rare enough that it's unlikely that both 1) the amount of data written is <512 KB, and 2) the number of affected ranges is large enough that the range tombstones now become problematic.

so assuming accurate statistics (which we now ensure) there is no way this can exceed 512 KB

I am curious about this "which we now ensure" -- can you point me to what changed?

We ensure that we do not take the point deletion path if stats are estimated. We trust that stats are accurate when they are claimed to be, and I believe we have multiple assertions for this during e.g. race builds and tests.

`ClearRange` avoids dropping a Pebble range tombstone if the amount of data that's deleted is small (<=512 KB), instead dropping point deletions. It uses MVCC statistics to determine this. However, when clearing an entire range, it will rely on the existing range MVCC stats rather than computing them. These range statistics can be highly inaccurate -- in some cases so inaccurate that they even become negative. This in turn can cause `ClearRange` to submit a huge write batch, which gets rejected by Raft with `command too large`. This patch avoids dropping point deletes if the statistics are estimated (which is only the case when clearing an entire range). Alternatively, it could do a full stats recomputation in this case, but entire range deletions seem likely to be large and/or rare enough that dropping a range tombstone is fine. Release note (bug fix): Fixed a bug where deleting data via schema changes (e.g. when dropping an index or table) could fail with a "command too large" error.

erikgrinaker · 2022-01-13T13:18:32Z

TFTR!

bors r=tbg

craig · 2022-01-13T14:05:51Z

Build succeeded:

GitHub CI (Cockroach)

On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of cockroachdb#71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR cockroachdb#74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with cockroachdb#74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches cockroachdb#68303. Release note: None

75448: kvserver: use Background() in computeChecksumPostApply goroutine r=erikgrinaker a=tbg On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of #71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR #74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with #74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches #68303. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of #71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR #74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with #74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches #68303. Release note: None

erikgrinaker added backport-21.1.x labels Jan 11, 2022

erikgrinaker requested review from ajwerner, sumeerbhola and a team January 11, 2022 12:15

erikgrinaker self-assigned this Jan 11, 2022

erikgrinaker force-pushed the clearrange-stats-estimates branch from 00b70ba to b621c07 Compare January 11, 2022 12:16

erikgrinaker changed the title ~~kvserver: don't use ClearRange point deletes with estimates MVCC stats~~ kvserver: don't use ClearRange point deletes with estimated MVCC stats Jan 11, 2022

erikgrinaker mentioned this pull request Jan 11, 2022

attempting to GC indexes: clearing index 2: command is too large #61206

Closed

erikgrinaker force-pushed the clearrange-stats-estimates branch from b621c07 to 6029e98 Compare January 11, 2022 19:14

erikgrinaker requested a review from a team January 13, 2022 10:24

erikgrinaker force-pushed the clearrange-stats-estimates branch from 6029e98 to bd5339f Compare January 13, 2022 10:32

tbg approved these changes Jan 13, 2022

View reviewed changes

erikgrinaker force-pushed the clearrange-stats-estimates branch from bd5339f to 410d60c Compare January 13, 2022 11:28

craig bot merged commit e3cbcf9 into cockroachdb:master Jan 13, 2022

This was referenced Jan 13, 2022

release-21.1: kvserver: don't use ClearRange point deletes with estimated MVCC stats #74797

Merged

release-21.2: kvserver: don't use ClearRange point deletes with estimated MVCC stats #74798

Merged

tbg mentioned this pull request Jan 24, 2022

roachtest: clearrange/zfs/checks=true failed #68303

Closed

tbg mentioned this pull request Jan 24, 2022

kvserver: use Background() in computeChecksumPostApply goroutine #75448

Merged

erikgrinaker deleted the clearrange-stats-estimates branch February 14, 2022 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: don't use `ClearRange` point deletes with estimated MVCC stats #74674

kvserver: don't use `ClearRange` point deletes with estimated MVCC stats #74674

erikgrinaker commented Jan 11, 2022 •

edited

Loading

cockroach-teamcity commented Jan 11, 2022

erikgrinaker commented Jan 11, 2022 •

edited

Loading

dt commented Jan 11, 2022 •

edited

Loading

erikgrinaker commented Jan 11, 2022 •

edited

Loading

sumeerbhola commented Jan 11, 2022

erikgrinaker commented Jan 11, 2022

erikgrinaker commented Jan 13, 2022

craig bot commented Jan 13, 2022

kvserver: don't use ClearRange point deletes with estimated MVCC stats #74674

kvserver: don't use ClearRange point deletes with estimated MVCC stats #74674

Conversation

erikgrinaker commented Jan 11, 2022 • edited Loading

cockroach-teamcity commented Jan 11, 2022

erikgrinaker commented Jan 11, 2022 • edited Loading

dt commented Jan 11, 2022 • edited Loading

erikgrinaker commented Jan 11, 2022 • edited Loading

sumeerbhola commented Jan 11, 2022

erikgrinaker commented Jan 11, 2022

erikgrinaker commented Jan 13, 2022

craig bot commented Jan 13, 2022

kvserver: don't use `ClearRange` point deletes with estimated MVCC stats #74674

kvserver: don't use `ClearRange` point deletes with estimated MVCC stats #74674

erikgrinaker commented Jan 11, 2022 •

edited

Loading

erikgrinaker commented Jan 11, 2022 •

edited

Loading

dt commented Jan 11, 2022 •

edited

Loading

erikgrinaker commented Jan 11, 2022 •

edited

Loading