Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: opt into Elastic CPU limiting resume spans #97916

Merged
merged 1 commit into from
Mar 8, 2023

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Mar 2, 2023

In #96691, we changed ExportRequest such that it returns resume spans that result from the elastic CPU limiter all the way to the caller.

This has at least two problems:

  1. In a mixed-version state, the caller might not yet know how to
    handle resume spans. This could result in incomplete responses
    erroneous being used as if they were full responses.

  2. The DistSender inspects a request to determine whether it may stop
    early. If it shouldn't be able to stop early, then the request is
    split up, possibly sent in parallel, and all responses are
    combined. The code which combines responses asserts that neither side
    has a resume span. As a result, we've seen failures such as

     crdb_internal.fingerprint(): combining
     /Tenant/2/Table/106/1/-{8403574544142222370/0-37656332809536692}
     with /Tenant/{2/Table/106/1/436440321206557763/0-3}
    

    since the change was made.

Here, we add a new request header field to allow callers to indicate whether they are prepared to accept resume spans. Further, we add that new field into the logic in DistSender which decides how to process requests.

The downside here is that crdb_internal.fingerprint won't have its requests sent in parallel.

Release note: None

Fixes #97886, #97903

Epic: none

@stevendanna stevendanna requested review from a team as code owners March 2, 2023 13:35
@stevendanna stevendanna requested review from a team, rhu713 and shermanCRL and removed request for a team March 2, 2023 13:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! For posterity noting that this boolean also ensures that in a mixed-version cluster if an older binary client sends a request that is evaluated on a newer binary we will not paginate because of CPUOverLimit.

// expects early-termination of requests based on the Elastic CPU
// limiter.
//
// Resume spans returned because of the unerlying request being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/unerlying/underlying/g

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 10 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rhu713, @shermanCRL, and @stevendanna)


-- commits line 11 at r1:
"erroneously"


pkg/kv/kvserver/batcheval/cmd_export.go line 257 at r1 (raw file):

						// onwards.
						if !build.IsRelease() {
							return result.Result{}, errors.AssertionFailedf("ExportRequest exited without " +

Unrelated to your change, in the release build case, should we be logging this?


pkg/kv/kvserver/batcheval/cmd_export.go line 313 at r1 (raw file):

					EndKey: args.EndKey,
				}
				// TODO(during review): Do we want to add another resume reason

nit: this looks stale, I assume "RESUME_ELASTIC_CPU_LIMIT" is exactly what this comment is talking about -- should we get rid of this comment?


pkg/sql/sem/builtins/builtins.go line 7557 at r1 (raw file):

					// DistSender so it likely substantially impacts
					// performance.
					ReturnElasticCPUResumeSpans: true,

I may be missing something, but going by the commit message, this function can't handle resume spans. Does that mean we should set this to false?

In cockroachdb#97886, we changed ExportRequest such that it returns resume spans
that result from the elastic CPU limiter all the way to the caller.

This has at least two problems:

1) In a mixed-version state, the caller might not yet know how to
   handle resume spans. This could result in incomplete responses
   erroneously being used as if they were full responses.

2) The DistSender inspects a request to determine whether it may stop
   early. If it shouldn't be able to stop early, then the request is
   split up, possibly sent in parallel, and all responses are
   combined. The code which combines responses asserts that neither side
   has a resume span.  As a result, we've seen failures such as

        crdb_internal.fingerprint(): combining
        /Tenant/2/Table/106/1/-{8403574544142222370/0-37656332809536692}
        with /Tenant/{2/Table/106/1/436440321206557763/0-3}

   since the change was made.

Here, we add a new request header field to allow callers to indicate
whether they are prepared to accept resume spans. Further, we add that
new field into the logic in DistSender which decides how to process
requests. Now, if ReturnElasticCPUResumeSpans is set, the DistSender
will no longer send requests in parallel and knows to expect a
possible early exit.

The downside here is that crdb_internal.fingerprint won't have its
requests sent in parallel.

Release note: None

Epic: none
Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @arulajmani, @rhu713, and @shermanCRL)


-- commits line 11 at r1:

Previously, arulajmani (Arul Ajmani) wrote…

"erroneously"

Thanks! Fixed.


pkg/kv/kvpb/api.proto line 2642 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

nit: s/unerlying/underlying/g

Thanks! Fixed.


pkg/kv/kvserver/batcheval/cmd_export.go line 257 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Unrelated to your change, in the release build case, should we be logging this?

Good idea, now we log a warning.


pkg/kv/kvserver/batcheval/cmd_export.go line 313 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: this looks stale, I assume "RESUME_ELASTIC_CPU_LIMIT" is exactly what this comment is talking about -- should we get rid of this comment?

Removed.


pkg/sql/sem/builtins/builtins.go line 7557 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I may be missing something, but going by the commit message, this function can't handle resume spans. Does that mean we should set this to false?

I added a bit to the commit message to clarity this. In this case, the caller knew how to handle the resume spans. But, DistSender needs to know whether a resume span is possible to properly handle the request. The two lines changed in dist_sender effectively prevents dist_sender from sending the requests in parallel and ensures it expects to get a resume span, avoiding the issue that fingerprinting was hitting.

Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @arulajmani, @rhu713, and @shermanCRL)

@stevendanna
Copy link
Collaborator Author

The lint failure here seems to be a problem inside one of the linters. It passed on a re-run.

bors r=adityamaru,arulajmani

@craig
Copy link
Contributor

craig bot commented Mar 8, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: fix incorrect handling of ExportRequests that are pre-empted by the CPU limiter
4 participants