-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: opt into Elastic CPU limiting resume spans #97916
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! For posterity noting that this boolean also ensures that in a mixed-version cluster if an older binary client sends a request that is evaluated on a newer binary we will not paginate because of CPUOverLimit.
pkg/kv/kvpb/api.proto
Outdated
// expects early-termination of requests based on the Elastic CPU | ||
// limiter. | ||
// | ||
// Resume spans returned because of the unerlying request being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/unerlying/underlying/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 10 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rhu713, @shermanCRL, and @stevendanna)
-- commits
line 11 at r1:
"erroneously"
pkg/kv/kvserver/batcheval/cmd_export.go
line 257 at r1 (raw file):
// onwards. if !build.IsRelease() { return result.Result{}, errors.AssertionFailedf("ExportRequest exited without " +
Unrelated to your change, in the release build case, should we be logging this?
pkg/kv/kvserver/batcheval/cmd_export.go
line 313 at r1 (raw file):
EndKey: args.EndKey, } // TODO(during review): Do we want to add another resume reason
nit: this looks stale, I assume "RESUME_ELASTIC_CPU_LIMIT" is exactly what this comment is talking about -- should we get rid of this comment?
pkg/sql/sem/builtins/builtins.go
line 7557 at r1 (raw file):
// DistSender so it likely substantially impacts // performance. ReturnElasticCPUResumeSpans: true,
I may be missing something, but going by the commit message, this function can't handle resume spans. Does that mean we should set this to false?
3edc2d5
to
1320f4c
Compare
In cockroachdb#97886, we changed ExportRequest such that it returns resume spans that result from the elastic CPU limiter all the way to the caller. This has at least two problems: 1) In a mixed-version state, the caller might not yet know how to handle resume spans. This could result in incomplete responses erroneously being used as if they were full responses. 2) The DistSender inspects a request to determine whether it may stop early. If it shouldn't be able to stop early, then the request is split up, possibly sent in parallel, and all responses are combined. The code which combines responses asserts that neither side has a resume span. As a result, we've seen failures such as crdb_internal.fingerprint(): combining /Tenant/2/Table/106/1/-{8403574544142222370/0-37656332809536692} with /Tenant/{2/Table/106/1/436440321206557763/0-3} since the change was made. Here, we add a new request header field to allow callers to indicate whether they are prepared to accept resume spans. Further, we add that new field into the logic in DistSender which decides how to process requests. Now, if ReturnElasticCPUResumeSpans is set, the DistSender will no longer send requests in parallel and knows to expect a possible early exit. The downside here is that crdb_internal.fingerprint won't have its requests sent in parallel. Release note: None Epic: none
1320f4c
to
83a3724
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @arulajmani, @rhu713, and @shermanCRL)
Previously, arulajmani (Arul Ajmani) wrote…
"erroneously"
Thanks! Fixed.
pkg/kv/kvpb/api.proto
line 2642 at r1 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
nit: s/unerlying/underlying/g
Thanks! Fixed.
pkg/kv/kvserver/batcheval/cmd_export.go
line 257 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Unrelated to your change, in the release build case, should we be logging this?
Good idea, now we log a warning.
pkg/kv/kvserver/batcheval/cmd_export.go
line 313 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: this looks stale, I assume "RESUME_ELASTIC_CPU_LIMIT" is exactly what this comment is talking about -- should we get rid of this comment?
Removed.
pkg/sql/sem/builtins/builtins.go
line 7557 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I may be missing something, but going by the commit message, this function can't handle resume spans. Does that mean we should set this to false?
I added a bit to the commit message to clarity this. In this case, the caller knew how to handle the resume spans. But, DistSender needs to know whether a resume span is possible to properly handle the request. The two lines changed in dist_sender effectively prevents dist_sender from sending the requests in parallel and ensures it expects to get a resume span, avoiding the issue that fingerprinting was hitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @arulajmani, @rhu713, and @shermanCRL)
The lint failure here seems to be a problem inside one of the linters. It passed on a re-run. bors r=adityamaru,arulajmani |
Build succeeded: |
In #96691, we changed ExportRequest such that it returns resume spans that result from the elastic CPU limiter all the way to the caller.
This has at least two problems:
In a mixed-version state, the caller might not yet know how to
handle resume spans. This could result in incomplete responses
erroneous being used as if they were full responses.
The DistSender inspects a request to determine whether it may stop
early. If it shouldn't be able to stop early, then the request is
split up, possibly sent in parallel, and all responses are
combined. The code which combines responses asserts that neither side
has a resume span. As a result, we've seen failures such as
since the change was made.
Here, we add a new request header field to allow callers to indicate whether they are prepared to accept resume spans. Further, we add that new field into the logic in DistSender which decides how to process requests.
The downside here is that crdb_internal.fingerprint won't have its requests sent in parallel.
Release note: None
Fixes #97886, #97903
Epic: none