Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to replace SortMergeJoin with ShuffleHashJoin #1006

Closed
Tracked by #391 ...
andygrove opened this issue Oct 8, 2024 · 1 comment · Fixed by #1007
Closed
Tracked by #391 ...

Add option to replace SortMergeJoin with ShuffleHashJoin #1006

andygrove opened this issue Oct 8, 2024 · 1 comment · Fixed by #1007
Assignees
Labels
enhancement New feature or request performance

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

Other Spark accelerators, such as Spark RAPIDS and Apache Gluten, replace SortMergeJoin with ShuffleHashJoin for improved performance. We should evaluate this approach for Comet.

Spark RAPIDS

  val ENABLE_REPLACE_SORTMERGEJOIN = conf("spark.rapids.sql.replaceSortMergeJoin.enabled")
    .doc("Allow replacing sortMergeJoin with HashJoin")
    .booleanConf
    .createWithDefault(true)

Apache Gluten

  val COLUMNAR_FPRCE_SHUFFLED_HASH_JOIN_ENABLED =
    buildConf("spark.gluten.sql.columnar.forceShuffledHashJoin")
      .internal()
      .booleanConf
      .createWithDefault(true)
/**
 * If force ShuffledHashJoin, convert [[SortMergeJoinExec]] to [[ShuffledHashJoinExec]]. There is no
 * need to select a smaller table as buildSide here, it will be reselected when offloading.
 */
object RewriteJoin extends RewriteSingleNode with JoinSelectionHelper {

Describe the potential solution

No response

Additional context

No response

@viirya
Copy link
Member

viirya commented Oct 9, 2024

It sounds reasonable. The vectorized implementation of SMJ looks inefficient in DataFusion. I'm not sure if there is any optimized algorithm for SMJ in vectorized execution. If not, using SHJ to replace SMJ will be good for performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
2 participants