kvstreamer: improve avg response estimation for partial responses #103602

yuzefovich · 2023-05-18T16:16:12Z

This commit fixes the avg response estimator when it comes to handling
partial responses. "Partial responses" mean that a single ScanRequest
was evaluated across multiple BatchRequests because the response
happened to be large enough to exceed TargetBytes budget we used. For
example, if a ScanRequest needs to fetch 100KiB of data, but the
Streamer only gives 1KiB TargetBytes budget, then this ScanRequest will
be evaluated across 100 BatchRequests, and each BatchRequest would
contain a single "partial" response of 1KiB in size.

Previously, the avg response estimator would treat these partial
responses independently, so it would come up with the estimate of 1KiB
for that example. However, this is very suboptimal - the goal of the
estimator is to guess the footprint of the whole response to a
single-range request. As a result of the previous behavior, the streamer
could keep on paginating the ScanResponse in very small batches (with
the estimate never increasing) which would then lead to long latency for
query evaluation.

This commit fixes this problem by adjusting what the estimator includes
in the denominator for the average computation. In particular, this
commit makes it so that only "non-resume" responses are included into
that number. The idea is that when we receive the first paginated
response, we increase the counter of responses, but on all consequent
"resume" responses - we don't. This allows us to "update" the footprint
of the big ScanResponse that is being paginated across multiple
BatchRequests. In the example above, our estimate will be growing
exponentially, and as a result, instead of performing 100 BatchRequests,
we will now do only 7.

This requires tracking whether we have received a response to
a particular ScanRequest, and in order to optimize the space usage of
such tracking, a simple utility bitmap package is introduced.

Impact on TPCH (average over 50 runs of tpchvec/perf):

Q1:	before: 3.21s	after: 3.23s	 0.79%
Q2:	before: 3.38s	after: 3.16s	 -6.42%
Q3:	before: 2.56s	after: 2.55s	 -0.11%
Q4:	before: 1.76s	after: 1.61s	 -8.41%
Q5:	before: 2.55s	after: 2.47s	 -3.37%
Q6:	before: 4.64s	after: 4.65s	 0.21%
Q7:	before: 5.89s	after: 5.56s	 -5.57%
Q8:	before: 1.09s	after: 1.07s	 -1.33%
Q9:	before: 5.61s	after: 5.55s	 -1.05%
Q10:	before: 2.21s	after: 2.09s	 -5.47%
Q11:	before: 0.97s	after: 0.94s	 -2.64%
Q12:	before: 4.88s	after: 4.43s	 -9.31%
Q13:	before: 1.15s	after: 1.01s	 -11.92%
Q14:	before: 0.45s	after: 0.45s	 1.16%
Q15:	before: 2.53s	after: 2.51s	 -0.58%
Q16:	before: 0.92s	after: 0.90s	 -2.38%
Q17:	before: 0.24s	after: 0.24s	 -0.58%
Q18:	before: 2.03s	after: 2.02s	 -0.46%
Q19:	before: 0.48s	after: 0.48s	 -0.99%
Q20:	before: 9.74s	after: 9.71s	 -0.38%
Q21:	before: 5.05s	after: 5.01s	 -0.81%
Q22:	before: 0.58s	after: 0.57s	 -0.38%

Addresses: #82164.
Fixes: #103586.

Release note: None

blathers-crl · 2023-05-18T16:16:15Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-05-18T16:16:20Z

This change is

DrewKimball

Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

pkg/kv/kvclient/kvstreamer/avg_response_estimator.go line 28 at r1 (raw file):

	// only counted once in this number (although if the request spans multiple
	// ranges, each single-range request is tracked separately here).
	numNonResumeResponses float64

[nit] Would numCompleteResponses work?

pkg/kv/kvclient/kvstreamer/avg_response_estimator.go line 62 at r1 (raw file):

// estimate includes the footprint of all those "resume" responses as well as of
// the first "non-resume" response.
// TODO(yuzefovich): we might want to have a separate estimate for Gets and

We could also consider a weighted moving average to handle cases were most responses are small, but one is very large

pkg/kv/kvclient/kvstreamer/requests_provider.go line 88 at r1 (raw file):

	//
	// isResumeScan is only allocated if at least one Scan request was enqueued.
	isResumeScan *bitmap.Bitmap

Is this bitmap necessary? Could we just check if ResponseHeader.ResumeSpan is set, and only increment the complete responses count when it's unset?

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @michae2)

pkg/kv/kvclient/kvstreamer/avg_response_estimator.go line 28 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] Would numCompleteResponses work?

I don't think so - a ScanResponse can be "incomplete" and "non-resume" at the same time. Imagine a ScanRequest for [a, d) for which we first received ScanResponse on [a, b) - this "initial" ScanResponse is "non-resume" (because it wasn't created to evaluate a "resume" ScanRequest) and is not complete (because we need to issue another "resume" ScanRequest [b, d) to fetch remaining data). When we issue the second ScanRequest (which is the "resume" request), the response to that request can be complete, yet we don't want to include that into numNonResumeResponses number.

What this number is tracking is the count of responses which are first "chunks" (or "pages") to their corresponding requests. For Gets we can only have one KV, so it always be a single "chunk", but for Scans that need to be paginated, only the first "page" needs to be included into this count.

I'm having trouble coming up with better naming here, so curious to hear other ideas.

pkg/kv/kvclient/kvstreamer/avg_response_estimator.go line 62 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

We could also consider a weighted moving average to handle cases were most responses are small, but one is very large

True, there is an existing TODO, along these lines, below at the end of this function to explore other functions.

pkg/kv/kvclient/kvstreamer/requests_provider.go line 88 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Is this bitmap necessary? Could we just check if ResponseHeader.ResumeSpan is set, and only increment the complete responses count when it's unset?

This would change the meaning of what we're computing and could result in significant over-estimation of the average size. Consider an example where we have 3 ScanRequests like [a1, z1), [a2, z2), [a3, z3) and suppose that on the first BatchRequest we fetched 1KiB of data for each, stopping at b. None of the requests is complete, so if we do what you're suggesting, then the denominator would be 0 and the numerator is 3KiB. What should the average be in this case?

If we change one ScanRequest from [a1, z1) to [a1, b), then we still fetch 3KiB of data across 3 ScanRequests, yet we now have 1 complete, so the average response size now becomes 3KiB too, which is an over-estimate at this point (we know that 1 ScanResponse size is 1KiB, but two other requests aren't yet complete). If we proceed to paginating other two ScanRequests, then "total response bytes" will keep on growing while the "number of complete responses" will remain at 1, which will make the over-estimate much more significant.

Does this make sense?

DrewKimball

Looks great! I see you have backport tags - do you think it should be gated behind a setting (even if default on) or is it safe?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

pkg/kv/kvclient/kvstreamer/avg_response_estimator.go line 28 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I don't think so - a ScanResponse can be "incomplete" and "non-resume" at the same time. Imagine a ScanRequest for [a, d) for which we first received ScanResponse on [a, b) - this "initial" ScanResponse is "non-resume" (because it wasn't created to evaluate a "resume" ScanRequest) and is not complete (because we need to issue another "resume" ScanRequest [b, d) to fetch remaining data). When we issue the second ScanRequest (which is the "resume" request), the response to that request can be complete, yet we don't want to include that into numNonResumeResponses number.

What this number is tracking is the count of responses which are first "chunks" (or "pages") to their corresponding requests. For Gets we can only have one KV, so it always be a single "chunk", but for Scans that need to be paginated, only the first "page" needs to be included into this count.

I'm having trouble coming up with better naming here, so curious to hear other ideas.

What we really care about here is whether we've started receiving results for a given request, right? So maybe we could do something like numResponsesStarted

pkg/kv/kvclient/kvstreamer/requests_provider.go line 88 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This would change the meaning of what we're computing and could result in significant over-estimation of the average size. Consider an example where we have 3 ScanRequests like [a1, z1), [a2, z2), [a3, z3) and suppose that on the first BatchRequest we fetched 1KiB of data for each, stopping at b. None of the requests is complete, so if we do what you're suggesting, then the denominator would be 0 and the numerator is 3KiB. What should the average be in this case?

If we change one ScanRequest from [a1, z1) to [a1, b), then we still fetch 3KiB of data across 3 ScanRequests, yet we now have 1 complete, so the average response size now becomes 3KiB too, which is an over-estimate at this point (we know that 1 ScanResponse size is 1KiB, but two other requests aren't yet complete). If we proceed to paginating other two ScanRequests, then "total response bytes" will keep on growing while the "number of complete responses" will remain at 1, which will make the over-estimate much more significant.

Does this make sense?

That does make sense, thanks for the explanation. I don't see a better way to track this, in that case.

This commit fixes the avg response estimator when it comes to handling partial responses. "Partial responses" mean that a single ScanRequest was evaluated across multiple BatchRequests because the response happened to be large enough to exceed TargetBytes budget we used. For example, if a ScanRequest needs to fetch 100KiB of data, but the Streamer only gives 1KiB TargetBytes budget, then this ScanRequest will be evaluated across 100 BatchRequests, and each BatchRequest would contain a single "partial" response of 1KiB in size. Previously, the avg response estimator would treat these partial responses independently, so it would come up with the estimate of 1KiB for that example. However, this is very suboptimal - the goal of the estimator is to guess the footprint of the whole response to a single-range request. As a result of the previous behavior, the streamer could keep on paginating the ScanResponse in very small batches (with the estimate never increasing) which would then lead to long latency for query evaluation. This commit fixes this problem by adjusting what the estimator includes in the denominator for the average computation. In particular, this commit makes it so that only "non-resume" responses are included into that number. The idea is that when we receive the first paginated response, we increase the counter of responses, but on all consequent "resume" responses - we don't. This allows us to "update" the footprint of the big ScanResponse that is being paginated across multiple BatchRequests. In the example above, our estimate will be growing exponentially, and as a result, instead of performing 100 BatchRequests, we will now do only 7. This requires tracking whether we have received a response to a particular ScanRequest, and in order to optimize the space usage of such tracking, a simple utility bitmap package is introduced. Impact on TPCH (average over 50 runs of `tpchvec/perf`): ``` Q1: before: 3.21s after: 3.23s 0.79% Q2: before: 3.38s after: 3.16s -6.42% Q3: before: 2.56s after: 2.55s -0.11% Q4: before: 1.76s after: 1.61s -8.41% Q5: before: 2.55s after: 2.47s -3.37% Q6: before: 4.64s after: 4.65s 0.21% Q7: before: 5.89s after: 5.56s -5.57% Q8: before: 1.09s after: 1.07s -1.33% Q9: before: 5.61s after: 5.55s -1.05% Q10: before: 2.21s after: 2.09s -5.47% Q11: before: 0.97s after: 0.94s -2.64% Q12: before: 4.88s after: 4.43s -9.31% Q13: before: 1.15s after: 1.01s -11.92% Q14: before: 0.45s after: 0.45s 1.16% Q15: before: 2.53s after: 2.51s -0.58% Q16: before: 0.92s after: 0.90s -2.38% Q17: before: 0.24s after: 0.24s -0.58% Q18: before: 2.03s after: 2.02s -0.46% Q19: before: 0.48s after: 0.48s -0.99% Q20: before: 9.74s after: 9.71s -0.38% Q21: before: 5.05s after: 5.01s -0.81% Q22: before: 0.58s after: 0.57s -0.38% ``` Release note: None

yuzefovich

I think it should be a safe change, so we can get away without introducing a setting. My rationale for this is that it fixes a known performance deficiency (using too low of average response size estimate which leads to too many BatchRequests) without really having a possible downside (we will start issuing BatchRequests with larger TargetBytes budget, but that budget is always pre-preserved, so there shouldn't be an impact to stability).

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @michae2)

pkg/kv/kvclient/kvstreamer/avg_response_estimator.go line 28 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

What we really care about here is whether we've started receiving results for a given request, right? So maybe we could do something like numResponsesStarted

Yeah, that sounds right, I decided to go with numRequestsStarted.

craig · 2023-05-31T09:04:47Z

Build failed:

Bazel Essential CI (Cockroach)

yuzefovich · 2023-05-31T09:09:05Z

Unrelated flake.

bors r+

craig · 2023-05-31T09:32:44Z

Build failed:

Bazel Essential CI (Cockroach)

yuzefovich · 2023-05-31T09:40:55Z

Same flake, let's try one more time.

bors r+

craig · 2023-05-31T10:00:58Z

Build failed:

Bazel Essential CI (Cockroach)

yuzefovich · 2023-05-31T10:19:50Z

A different flake #100240 this time.

bors r+

craig · 2023-05-31T10:54:24Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-05-31T10:54:44Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from cd8a11d to blathers/backport-release-22.2-103602: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich force-pushed the streamer-avg branch 3 times, most recently from 5dc93c4 to caa6786 Compare May 19, 2023 09:20

yuzefovich added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels May 19, 2023

yuzefovich marked this pull request as ready for review May 19, 2023 09:21

yuzefovich requested a review from a team as a code owner May 19, 2023 09:21

yuzefovich requested review from msirek, michae2 and DrewKimball and removed request for msirek May 19, 2023 09:21

yuzefovich force-pushed the streamer-avg branch 2 times, most recently from 20b1d04 to a1d1583 Compare May 22, 2023 04:17

DrewKimball reviewed May 22, 2023

View reviewed changes

yuzefovich commented May 30, 2023

View reviewed changes

DrewKimball approved these changes May 30, 2023

View reviewed changes

yuzefovich force-pushed the streamer-avg branch from a1d1583 to cd8a11d Compare May 31, 2023 08:42

yuzefovich commented May 31, 2023

View reviewed changes

craig bot merged commit 186317e into cockroachdb:master May 31, 2023

blathers-crl bot mentioned this pull request May 31, 2023

release-23.1: kvstreamer: improve avg response estimation for partial responses #104134

Merged

yuzefovich deleted the streamer-avg branch May 31, 2023 11:01

yuzefovich mentioned this pull request May 31, 2023

release-22.2: kvstreamer: improve avg response estimation for partial responses #104137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvstreamer: improve avg response estimation for partial responses #103602

kvstreamer: improve avg response estimation for partial responses #103602

yuzefovich commented May 18, 2023 •

edited

Loading

blathers-crl bot commented May 18, 2023

cockroach-teamcity commented May 18, 2023

DrewKimball left a comment

yuzefovich left a comment

DrewKimball left a comment

yuzefovich left a comment

craig bot commented May 31, 2023

yuzefovich commented May 31, 2023

craig bot commented May 31, 2023

yuzefovich commented May 31, 2023

craig bot commented May 31, 2023

yuzefovich commented May 31, 2023

craig bot commented May 31, 2023

blathers-crl bot commented May 31, 2023

kvstreamer: improve avg response estimation for partial responses #103602

kvstreamer: improve avg response estimation for partial responses #103602

Conversation

yuzefovich commented May 18, 2023 • edited Loading

blathers-crl bot commented May 18, 2023

cockroach-teamcity commented May 18, 2023

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented May 31, 2023

yuzefovich commented May 31, 2023

craig bot commented May 31, 2023

yuzefovich commented May 31, 2023

craig bot commented May 31, 2023

yuzefovich commented May 31, 2023

craig bot commented May 31, 2023

blathers-crl bot commented May 31, 2023

yuzefovich commented May 18, 2023 •

edited

Loading