Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvstreamer: consider setting MaxSpanRequestKeys on parallel batches issued by the Streamer #67885

Open
Tracked by #54680
yuzefovich opened this issue Jul 21, 2021 · 1 comment
Labels
C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Jul 21, 2021

Once #67040 is implemented, we will have a library that performs parallel scans while adhering to memory limits. In order to simplify the discussion on the RFC, we have consciously put aside thinking about queries with LIMIT. As a follow-up task to improving the implementation/usage of the Streamer library we should revisit the cases when we have a hard or a soft limit and set MaxSpanRequestKeys on the batches whenever appropriate.

Quoting Nathan from the RFC review:

At a minimum, the Streamer can use MaxSpanRequestKeys to place upper bounds on each
individual batch of ScanRequests. Even if we assume that all other concurrent batches will
return 0 rows, this can still be useful to place an upper bound on how far we can overshoot
the limit. Without the use of MaxSpanRequestKeys, there is no limit to how far we can
overshoot. With a large enough TargetBytes and with small keys, we can pull back
thousands of unnecessary keys and scan hundreds of unnecessary ranges (especially with
many MVCC tombstones in the way). With the most conservative use of
MaxSpanRequestKeys, we can bound the amount we can overshoot to (P - 1) * limit keys.

This discussion is applicable to lookup (not index) joins, regardless of whether the lookup columns form a key and whether there is ON expression. Quoting Becca:

When lookup columns form a key, we don't know that all input rows will have matches. We
may select a larger number of input rows, but only want the top k that have matches.

When lookup columns don't form key:
- empty ON expression - we could set a hard limit on each lookup.
- non-empty ON expression - the optimizer can estimate the selectivity of the ON
  expression and determine a soft limit (or "limit hint") based on that.

Jira issue: CRDB-8760

@yuzefovich yuzefovich added the C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. label Jul 21, 2021
@yuzefovich yuzefovich self-assigned this Jul 21, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 21, 2021
@yuzefovich yuzefovich removed their assignment Jul 7, 2022
@yuzefovich yuzefovich changed the title sql: consider setting MaxSpanRequestKeys on parallel batches issued by the Streamer kvstreamer: consider setting MaxSpanRequestKeys on parallel batches issued by the Streamer Jul 21, 2022
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

1 participant