Add option to replace SortMergeJoin with ShuffleHashJoin #1006

andygrove · 2024-10-08T22:16:45Z

What is the problem the feature request solves?

Other Spark accelerators, such as Spark RAPIDS and Apache Gluten, replace SortMergeJoin with ShuffleHashJoin for improved performance. We should evaluate this approach for Comet.

Spark RAPIDS

  val ENABLE_REPLACE_SORTMERGEJOIN = conf("spark.rapids.sql.replaceSortMergeJoin.enabled")
    .doc("Allow replacing sortMergeJoin with HashJoin")
    .booleanConf
    .createWithDefault(true)

Apache Gluten

  val COLUMNAR_FPRCE_SHUFFLED_HASH_JOIN_ENABLED =
    buildConf("spark.gluten.sql.columnar.forceShuffledHashJoin")
      .internal()
      .booleanConf
      .createWithDefault(true)

/**
 * If force ShuffledHashJoin, convert [[SortMergeJoinExec]] to [[ShuffledHashJoinExec]]. There is no
 * need to select a smaller table as buildSide here, it will be reselected when offloading.
 */
object RewriteJoin extends RewriteSingleNode with JoinSelectionHelper {

Describe the potential solution

No response

Additional context

No response

viirya · 2024-10-09T01:22:13Z

It sounds reasonable. The vectorized implementation of SMJ looks inefficient in DataFusion. I'm not sure if there is any optimized algorithm for SMJ in vectorized execution. If not, using SHJ to replace SMJ will be good for performance.

andygrove added enhancement New feature or request performance labels Oct 8, 2024

This was referenced Oct 8, 2024

[EPIC] Improve performance of TPC-H queries #391

Open

[EPIC] Improve performance of TPC-DS queries #858

Open

andygrove self-assigned this Oct 9, 2024

andygrove mentioned this issue Oct 9, 2024

perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin #1007

Merged

andygrove closed this as completed in #1007 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to replace SortMergeJoin with ShuffleHashJoin #1006

Add option to replace SortMergeJoin with ShuffleHashJoin #1006

andygrove commented Oct 8, 2024

viirya commented Oct 9, 2024

Add option to replace SortMergeJoin with ShuffleHashJoin #1006

Add option to replace SortMergeJoin with ShuffleHashJoin #1006

Comments

andygrove commented Oct 8, 2024

What is the problem the feature request solves?

Spark RAPIDS

Apache Gluten

Describe the potential solution

Additional context

viirya commented Oct 9, 2024