Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: adjust physical planning heuristics around small joins #137562

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

yuzefovich
Copy link
Member

This PR adjusts the physical planner heuristics so that we no longer force the plan distribution in distsql=auto mode when we have a hash or a merge join with at least one equality column (i.e. non-cross join) when we have "small" inputs (less than 1k rows combined). See each commit for details.

Epic: None

@yuzefovich yuzefovich requested review from mgartner, michae2 and a team December 16, 2024 22:18
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the distsql-join branch 2 times, most recently from 3c5d71c to 320e6c9 Compare December 16, 2024 23:53
Long time ago in a74611e we introduced
a change so that when we merge two plans when performing a cross join
and both plans have a single result router (i.e. a single processor), we
would plan the cross join on the same node as the _left_ input. This was
done under the assumption that the left input is processed first by the
joiner, but this assumption has changed since then (long time ago too),
so we should be choosing the _right_ input. The impact seems very minor
though, just something I noticed while working around this area.

Release note: None
Previously, whenever we had a hash or a merge join with at least one
equality column (i.e. not a cross join), we would force the distribution
of the plan. This can be suboptimal when we need to process a small
number of rows. This commit makes the heuristic a bit more configurable:
we now will choose to distribute only when performing a "large" join,
where "large" is defined as having both inputs produce at least 1k rows
(configured via new `distribute_join_row_count_threshold` session
variable). If both inputs don't have stats available, we fall back to
the old behavior of distributing the plan (which seems like a safer
option).

Note the following operations still force the plan distribution
unconditionally:
- table statistics
- inverted filter (when we have a union of inverted spans)
- inverted join
- window functions with at least one window frame with PARTITION BY
clause
- zigzag join.

I think there are cases where we don't want to force the distribution
because of these operations, but they seem less of a concern, so I left
a few TODOs. Ideally, eventually we'll have the optimizer make the
decision.

Release note (sql change): DistSQL physical planning decisions under
`distsql=auto` mode have been adjusted in the following manner:
presence of the hash or merge join no longer forces the plan to be
distributed. Namely, we might not choose to distribute the plan if both
inputs to the join are expected produce small number of rows (less than
1k combined by default, configurable via
`distribute_join_row_count_threshold` session variable).
This commit introduces some logging, hidden behind a verbosity level, to
the physical planning whenever we choose `shouldDistribute`
recommendation. This will show up in the trace and should be helpful to
understand why we decided to distribute a particular query.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants