sql: adjust physical planning heuristics around small joins #137562

yuzefovich · 2024-12-16T22:18:52Z

This PR adjusts the physical planner heuristics so that we no longer force the plan distribution in distsql=auto mode when we have a hash or a merge join with at least one equality column (i.e. non-cross join) when we have "small" inputs (less than 1k rows combined). See each commit for details.

Epic: None

cockroach-teamcity · 2024-12-16T22:19:05Z

This change is

Long time ago in a74611e we introduced a change so that when we merge two plans when performing a cross join and both plans have a single result router (i.e. a single processor), we would plan the cross join on the same node as the _left_ input. This was done under the assumption that the left input is processed first by the joiner, but this assumption has changed since then (long time ago too), so we should be choosing the _right_ input. The impact seems very minor though, just something I noticed while working around this area. Release note: None

Previously, whenever we had a hash or a merge join with at least one equality column (i.e. not a cross join), we would force the distribution of the plan. This can be suboptimal when we need to process a small number of rows. This commit makes the heuristic a bit more configurable: we now will choose to distribute only when performing a "large" join, where "large" is defined as having both inputs produce at least 1k rows (configured via new `distribute_join_row_count_threshold` session variable). If both inputs don't have stats available, we fall back to the old behavior of distributing the plan (which seems like a safer option). Note the following operations still force the plan distribution unconditionally: - table statistics - inverted filter (when we have a union of inverted spans) - inverted join - window functions with at least one window frame with PARTITION BY clause - zigzag join. I think there are cases where we don't want to force the distribution because of these operations, but they seem less of a concern, so I left a few TODOs. Ideally, eventually we'll have the optimizer make the decision. Release note (sql change): DistSQL physical planning decisions under `distsql=auto` mode have been adjusted in the following manner: presence of the hash or merge join no longer forces the plan to be distributed. Namely, we might not choose to distribute the plan if both inputs to the join are expected produce small number of rows (less than 1k combined by default, configurable via `distribute_join_row_count_threshold` session variable).

This commit introduces some logging, hidden behind a verbosity level, to the physical planning whenever we choose `shouldDistribute` recommendation. This will show up in the trace and should be helpful to understand why we decided to distribute a particular query. Release note: None

yuzefovich requested review from mgartner, michae2 and a team December 16, 2024 22:18

yuzefovich force-pushed the distsql-join branch 2 times, most recently from 3c5d71c to 320e6c9 Compare December 16, 2024 23:53

yuzefovich added 3 commits December 16, 2024 15:56

yuzefovich force-pushed the distsql-join branch from 320e6c9 to 936d9ab Compare December 17, 2024 00:01

cockroachdb deleted a comment from blathers-crl bot Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: adjust physical planning heuristics around small joins #137562

sql: adjust physical planning heuristics around small joins #137562

yuzefovich commented Dec 16, 2024

cockroach-teamcity commented Dec 16, 2024

sql: adjust physical planning heuristics around small joins #137562

Are you sure you want to change the base?

sql: adjust physical planning heuristics around small joins #137562

Conversation

yuzefovich commented Dec 16, 2024

cockroach-teamcity commented Dec 16, 2024