Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: enables elastic CPU limiter for all users of ExportRequest #96691

Merged
merged 1 commit into from
Feb 25, 2023

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Feb 6, 2023

Previously, there was a strange coupling between the elastic CPU
limiter and the header.TargetBytes DistSender limit set on each
ExportRequest. Even if a request was preempted on exhausting its
allotted CPU tokens, it would only return from kvserver by virtue
of its header.TargetBytes being set to a non-zero value. Out of the
four users of ExportRequest, only backup set this field to a sentinel
value of 1 to limit the number of SSTs we send back in an ExportResponse.
The remaining callers of ExportRequest would not return from the kvserver.
Instead they would evaluate the request from the resume key immediately,
not giving the scheduler a chance to take the goroutine off CPU.

This change breaks this coupling by introducing a resumeInfo object that
indicates whether the resumption was because we were over our CPU limit. If
it was, we return an ExportResponse with our progress so far. This change
shifts the burden of handling pagination to the client. This seems better than
having the server sleep or wait around until its CPU tokens are replenished
as the client would be left wondering why a request is taking so long.

To that effect this change adds pagination support to the other callers of
ExportRequest. Note, we do not set SplitMidKey at these other callsites
yet. Thus, all pagination will happen at key boundaries in the ExportRequest.
A follow-up will add support for SplitMidKey to these callers.

Informs: #96684

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru force-pushed the paginate-everywhere branch 2 times, most recently from d6a94d1 to d46edea Compare February 9, 2023 14:49
@adityamaru adityamaru changed the title WIP,DNM: add pagination to all ExportRequest callsites *: enables elastic CPU limiter for all users of ExportRequest Feb 9, 2023
@adityamaru adityamaru force-pushed the paginate-everywhere branch 2 times, most recently from d1d4314 to 2033f54 Compare February 9, 2023 16:21
@adityamaru
Copy link
Contributor Author

adityamaru commented Feb 9, 2023

I was playing around with the test coverage tool and we have a pretty good coverage for mvcc.go and cmd_export.go. We hit almost all the lines relevant to export requests bar a few error conditions we never expect to hit.

Screenshot 2023-02-09 at 11 16 46 AM

Screenshot 2023-02-09 at 11 32 04 AM

@adityamaru adityamaru marked this pull request as ready for review February 9, 2023 16:36
@adityamaru adityamaru requested review from a team as code owners February 9, 2023 16:36
@adityamaru adityamaru requested review from a team and sumeerbhola and removed request for a team February 9, 2023 16:36
@adityamaru
Copy link
Contributor Author

friendly ping @stevendanna / @irfansharif , I have a couple of export related diffs queued up behind this so just wanted to put this on your radar. Thanks!

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. I've left some possible suggestions but none of them are blocking in my opinion.

db *kv.DB,
startKey, endKey roachpb.Key,
startTime, endTime hlc.Timestamp,
allRevs chan []VersionedValues,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two callers, so I don't feel strongly about this, but you could pass something like a func(batch []VersionedValues) which would give the caller a bit more flexibility.

pkg/sql/catalog/lease/lease.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/batcheval/cmd_export.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/batcheval/cmd_export.go Show resolved Hide resolved
@irfansharif irfansharif removed their request for review February 16, 2023 14:34
@adityamaru adityamaru force-pushed the paginate-everywhere branch 3 times, most recently from dddc968 to c0e5a41 Compare February 23, 2023 22:08
Previously, there was a strange coupling between the elastic CPU
limiter and the `header.TargetBytes` DistSender limit set on each
ExportRequest. Even if a request was preempted on exhausting its
allotted CPU tokens, it would only return from kvserver by virtue
of its `header.TargetBytes` being set to a non-zero value. Out of the
four users of ExportRequest, only backup set this field to a sentinel
value of 1 to limit the number of SSTs we send back in an ExportResponse.
The remaining callers of ExportRequest would not return from the kvserver.
Instead they would evaluate the request from the resume key immediately,
not giving the scheduler a chance to take the goroutine off CPU.

This change breaks this coupling by introducing a `resumeInfo` object that
indicates whether the resumption was because we were over our CPU limit. If
it was, we return an ExportResponse with our progress so far. This change
shifts the burden of handling pagination to the client. This seems better than
having the server sleep or wait around until its CPU tokens are replenished
as the client would be left wondering why a request is taking so long.

To that effect this change adds pagination support to the other callers of
ExportRequest. Note, we do not set `SplitMidKey` at these other callsites
yet. Thus, all pagination will happen at key boundaries in the ExportRequest.
A follow-up will add support for `SplitMidKey` to these callers.

Informs: cockroachdb#96684

Release note: None
@adityamaru
Copy link
Contributor Author

flake is #97320

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Feb 25, 2023

Build succeeded:

@craig craig bot merged commit 39c06b5 into cockroachdb:master Feb 25, 2023
@adityamaru adityamaru deleted the paginate-everywhere branch February 25, 2023 23:28
craig bot pushed a commit that referenced this pull request Mar 7, 2023
97916: kv: opt into Elastic CPU limiting resume spans r=adityamaru,arulajmani a=stevendanna

In #96691, we changed ExportRequest such that it returns resume spans that result from the elastic CPU limiter all the way to the caller.

This has at least two problems:

1) In a mixed-version state, the caller might not yet know how to
   handle resume spans. This could result in incomplete responses
   erroneous being used as if they were full responses.

2) The DistSender inspects a request to determine whether it may stop
   early. If it shouldn't be able to stop early, then the request is
   split up, possibly sent in parallel, and all responses are
   combined. The code which combines responses asserts that neither side
   has a resume span.  As a result, we've seen failures such as

        crdb_internal.fingerprint(): combining
        /Tenant/2/Table/106/1/-{8403574544142222370/0-37656332809536692}
        with /Tenant/{2/Table/106/1/436440321206557763/0-3}

   since the change was made.

Here, we add a new request header field to allow callers to indicate whether they are prepared to accept resume spans. Further, we add that new field into the logic in DistSender which decides how to process requests.

The downside here is that crdb_internal.fingerprint won't have its requests sent in parallel.

Release note: None

Fixes #97886,  #97903

Epic: none

Co-authored-by: Steven Danna <danna@cockroachlabs.com>
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Mar 9, 2023
In cockroachdb#96691, we changed the return type of mvccExportToWriter such that
it now indicates when a CPU limit has been reached. As part of that
change, when the CPU limit was reached, we would immedately `return`
rather than `break`ing out of the loop. This introduced a bug, since
the code after the loop that the `break` was taking us to is
important. Namely, we may have previously buffered range keys that we
need to write into our response still. By replacing the break with a
return, these range keys were lost.

The end-user impact of this is that a BACKUP that _ought_ to have
included range keys (such as a backup of a table with a rolled back
IMPORT) would not include those range keys and thus would end up
resurecting deleted keys upon restore.

This PR brings back the `break` and adds a test that covers exporting
with range keys under CPU exhaustion.

This bug only ever existed on master.

Informs cockroachdb#97694

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Mar 9, 2023