[SPARK-26065][SQL] Change query hint from a `LogicalPlan` to a field #23036

maryannxue · 2018-11-14T19:07:33Z

What changes were proposed in this pull request?

The existing query hint implementation relies on a logical plan node ResolvedHint to store query hints in logical plans, and on Statistics in physical plans. Since ResolvedHint is not really a logical operator and can break the pattern matching for existing and future optimization rules, it is a issue to the Optimizer as the old AnalysisBarrier was to the Analyzer.

Given the fact that all our query hints are either 1) a join hint, i.e., broadcast hint; or 2) a re-partition hint, which is indeed an operator, we only need to add a hint field on the Join plan and that will be a good enough solution for the current hint usage.

This PR is to let Join node have a hint for its left sub-tree and another hint for its right sub-tree and each hint is a merged result of all the effective hints specified in the corresponding sub-tree. The "effectiveness" of a hint, i.e., whether that hint should be propagated to the Join node, is currently consistent with the hint propagation rules originally implemented in the Statistics approach. Note that the ResolvedHint node still has to live through the analysis stage because of the Dataset interface, but it will be got rid of and moved to the Join node in the "pre-optimization" stage.

This PR also introduces a change in how hints work with join reordering. Before this PR, hints would stop join reordering. For example, in "a.join(b).join(c).hint("broadcast").join(d)", the broadcast hint would stop d from participating in the cost-based join reordering while still allowing reordering from under the hint node. After this PR, though, the broadcast hint will not interfere with join reordering at all, and after reordering if a relation associated with a hint stays unchanged or equivalent to the original relation, the hint will be retained, otherwise will be discarded. For example, the original plan is like "a.join(b).hint("broadcast").join(c).hint("broadcast").join(d)", thus the join order is "a JOIN b JOIN c JOIN d". So if after reordering the join order becomes "a JOIN b JOIN (c JOIN d)", the plan will be like "a.join(b).hint("broadcast").join(c.join(d))"; but if after reordering the join order becomes "a JOIN c JOIN b JOIN d", the plan will be like "a.join(c).join(b).hint("broadcast").join(d)".

How was this patch tested?

Added new tests.

maryannxue · 2018-11-14T19:09:38Z

cc @gatorsmile @cloud-fan @rxin @juliuszsompolski

SparkQA · 2018-11-14T22:42:28Z

Test build #98836 has finished for PR 23036 at commit 785a423.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

Good jobs! The major issues are the test case coverage.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

gatorsmile · 2018-12-15T01:54:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

-      // returned by cache lookup should not have hint info. If we lookup the cache with a
-      // semantically same plan with a different hint info, `CacheManager.useCachedData` will take
-      // care of it and retain the hint info in the lookup input plan.
-      statsOfPlanToCache.copy(hints = HintInfo())


gatorsmile · 2018-12-15T02:02:02Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SameResultSuite.scala

+    val df2 = testRelation.join(testRelation)
+    val df1Optimized = Optimize.execute(df1.analyze)
+    val df2Optimized = Optimize.execute(df2.analyze)
+    assertSameResult(df1Optimized, df2Optimized)


This should be a new test suite for EliminateResolvedHint. We only need to compare the plans and verifies the result.

No. This is to test Join.doCanonicalize().

gatorsmile · 2018-12-15T02:27:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

@@ -453,6 +454,7 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
        case Some(serde) => table.identifier :: serde :: Nil
        case _ => table.identifier :: Nil
      }
+    case hint: JoinHint if hint.leftHint.isEmpty && hint.rightHint.isEmpty => Nil


Can we avoid adding this? Let us add override def simpleString in case class Join? Does this help?

or we can override stringArgs in Join.

gatorsmile · 2018-12-15T02:44:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-      case j @ Join(left, right, NaturalJoin(joinType), condition) if j.resolvedExceptNatural =>
+        commonNaturalJoinProcessing(left, right, joinType, usingCols, None, hint)
+      case j @ Join(left, right, NaturalJoin(joinType), condition, hint)
+        if j.resolvedExceptNatural =>


Nit: two more space

gatorsmile · 2018-12-15T02:49:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

@@ -40,23 +40,33 @@ object CostBasedJoinReorder extends Rule[LogicalPlan] with PredicateHelper {
    if (!conf.cboEnabled || !conf.joinReorderEnabled) {
      plan
    } else {
+      // Use a map to track the hints on the join items. If a join relation turns out unchanged
+      // at the end of the join reorder, we can apply the original hint back to it if any.


This needs a few test cases to ensure this works as expected.

gatorsmile · 2018-12-15T02:58:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

@@ -164,25 +167,35 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper {
   * was involved in an explicit cross join. Also returns the entire list of join conditions for
   * the left-deep tree.
   */
-  def flattenJoin(plan: LogicalPlan, parentJoinType: InnerLike = Inner)
+  def flattenJoin(


For the changes in this function, we need a few test cases in the rule ReorderJoin

The tests are in JoinHintsSuite: "hint preserved after join reorder".

gatorsmile · 2018-12-15T03:03:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

+      // except and intersect are semi/anti-joins which won't return more data then
+      // their left argument, so the broadcast hint should be propagated here
+      case i: Intersect => collectHints(i.left)
+      case e: Except => collectHints(e.left)


Test cases for Intersect and Except are needed.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

cloud-fan · 2018-12-18T15:09:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

@@ -40,23 +40,33 @@ object CostBasedJoinReorder extends Rule[LogicalPlan] with PredicateHelper {
    if (!conf.cboEnabled || !conf.joinReorderEnabled) {
      plan
    } else {
+      // Use a map to track the hints on the join items. If a join relation turns out unchanged


how to define "unchanged"? If (a join b) join c becomes (b join a) join c, is there any hit we want to retain?

cloud-fan · 2018-12-18T15:10:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

-        val (leftPlans, leftConditions) = extractInnerJoins(left)
-        val (rightPlans, rightConditions) = extractInnerJoins(right)
+      case Join(left, right, _: InnerLike, Some(cond), hint) =>
+        hint.leftHint.map(hintMap.put(left, _))


for purely side-effect function, use foreach instead of map

cloud-fan · 2018-12-18T15:14:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

+ * Hint that is associated with a [[Join]] node, with [[HintInfo]] on its left child and on its
+ * right child respectively.
+ */
+case class JoinHint(


The indentation is wrong here

You mean should be 2 spaces instead of 4 before leftHint and rightHint?

https://github.com/databricks/scala-style-guide#spacing-and-indentation

SparkQA · 2019-01-02T18:33:56Z

Test build #100648 has finished for PR 23036 at commit 93f33d9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-02T19:17:58Z

Test build #100652 has finished for PR 23036 at commit ee0c844.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-03T00:05:25Z

Test build #100653 has finished for PR 23036 at commit 470d682.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-04T13:21:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

+      input: Seq[(LogicalPlan, InnerLike)],
+      conditions: Seq[Expression],
+      leftPlans: Seq[LogicalPlan],
+      hintMap: Map[Seq[LogicalPlan], HintInfo]): LogicalPlan = {


why the map key is Seq[LogicalPlan]?

After ReorderJoin, new conditions might be pushed into a join relation. For example, in https://github.com/apache/spark/pull/23036/files#diff-fb10f33381c6d7cc8bfbde63d7f2c557R109, the join order has remained the same, but the first join between "a" and "b" now has a new condition, as "a.a1 = b.b1". I'd still wanna treat it as the same join, thus retaining its hint. If we were to compare the join LogicalPlan, they would not match. Since ReorderJoin is simply dealing with left-deep trees, as long as seq of join child relations are fixed, the join order is fixed too. So we can compare the seq of join child relations instead, in order to accommodate this "new conditions pushed down" situation.

cloud-fan · 2019-01-04T13:34:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

+  def collectHints(plan: LogicalPlan): Seq[HintInfo] = {
+    plan match {
+      case h: ResolvedHint => collectHints(h.child) :+ h.hints
+      case u: UnaryNode => collectHints(u.child)


I'm not sure if it's safe to collect hint through other operators. e.g. Generate is a unary node which produces more data than its child, and we may add more hints in the future which can't be propagated through operators.

I think a safer way is to only collect hints from the ResolvedHint operator if it's a child of Join.

This is following the original behavior (which is the original hint bottom-up propagation logic in stats visitor) although I'm more inclined to make it work as you suggested here.

I'll start a follow-up PR if any hint behavior needs to be revisited

Create an umbrella JIRA and includes all these follow-up JIRAs. For example, add a new conf for enabling/disabling the silent ignorance of inapplicable hints.

SparkQA · 2019-01-05T02:59:55Z

Test build #100761 has finished for PR 23036 at commit f51e31d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class JoinHint(leftHint: Option[HintInfo], rightHint: Option[HintInfo])

gatorsmile · 2019-01-06T01:14:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -115,6 +115,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
    // However, because we also use the analyzer to canonicalized queries (for view definition),
    // we do not eliminate subqueries or compute current time in the analyzer.
    Batch("Finish Analysis", Once,
+      EliminateResolvedHint,


Also add it to nonExcludableRules

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

gatorsmile · 2019-01-06T03:18:28Z

LGTM except a few minor comments.

mgaido91 · 2019-01-06T15:23:41Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -288,7 +288,8 @@ case class Join(
    left: LogicalPlan,
    right: LogicalPlan,
    joinType: JoinType,
-    condition: Option[Expression])
+    condition: Option[Expression],
+    hint: JoinHint)


what about Option[JoinHint ] with default to None?

This is to make sure that whenever the constructor is called, the caller is clearly aware of this hint parameter and will set it right. This happens mostly in the optimizer where the rules transform a join node into a new one, and not with a copy constructor.

we can skip setting the default value then, but on the other side the default value helps making the diff of this patch smaller and I think in general makes sense, since most of the time we are not concerned about the hint. Anyway I am fine also without the default value.

mgaido91 · 2019-01-06T15:27:21Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    super.doCanonicalize().asInstanceOf[Join].copy(hint = JoinHint.NONE)
+
+  // Do not include an empty join hint in string description
+  protected override def stringArgs: Iterator[Any] = super.stringArgs.filter { e =>


if we move to a Option[JoinHint] is this still needed?

I'm fine with Option[JoinHint] without the default value, if it can help us get rid of this hack.

SparkQA · 2019-01-06T23:20:07Z

Test build #100844 has finished for PR 23036 at commit 17b7cce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

cloud-fan · 2019-01-07T11:22:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala

+ * Replaces [[ResolvedHint]] operators from the plan. Move the [[HintInfo]] to associated [[Join]]
+ * operators, otherwise remove it if no [[Join]] operator is matched.
+ */
+object EliminateResolvedHint extends Rule[LogicalPlan] {


do we have to run it at the beginning of optimizer? Can we run it at the end of analyzer?

It's because of the Dataset interface. The ResolvedHint of the join children nodes would have been gone by the time we construct a join node.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

cloud-fan · 2019-01-07T11:27:12Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+
+  // Ignore hint for canonicalization
+  protected override def doCanonicalize(): LogicalPlan =
+    super.doCanonicalize().asInstanceOf[Join].copy(hint = JoinHint.NONE)


how about copy(hint = JoinHint.NONE). doCanonicalize()

It'd cause stack overflow.

cloud-fan · 2019-01-07T11:30:15Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    super.doCanonicalize().asInstanceOf[Join].copy(hint = JoinHint.NONE)
+
+  // Do not include an empty join hint in string description
+  protected override def stringArgs: Iterator[Any] = super.stringArgs.filter { e =>


how about

val hintArg = if (hint.leftHint.isEmpty && hint.rightHint.isEmpty) Nil else Seq(hint) Seq(left, right, joinType, condition) ++ hintArg

I was trying to do it in a way that would be "extendable", say, it would work with any future change of the constructor (although we don't expect the constructor of logical operators to change much).

cloud-fan · 2019-01-07T11:35:14Z

Thanks for the nice cleanup! LGTM except some minor comments.

SparkQA · 2019-01-07T21:49:04Z

Test build #100899 has finished for PR 23036 at commit 97377dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-01-07T21:58:49Z

Thanks! Merged to master.

## What changes were proposed in this pull request? This is to fix a bug in #23036 that would cause a join hint to be applied on node it is not supposed to after join reordering. For example, ``` val join = df.join(df, "id") val broadcasted = join.hint("broadcast") val join2 = join.join(broadcasted, "id").join(broadcasted, "id") ``` There should only be 2 broadcast hints on `join2`, but after join reordering there would be 4. It is because the hint application in join reordering compares the attribute set for testing relation equivalency. Moreover, it could still be problematic even if the child relations were used in testing relation equivalency, due to the potential exprId conflict in nested self-join. As a result, this PR simply reverts the join reorder hint behavior change introduced in #23036, which means if a join hint is present, the join node itself will not participate in the join reordering, while the sub-joins within its children still can. ## How was this patch tested? Added new tests Closes #23524 from maryannxue/query-hint-followup-2. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? The existing query hint implementation relies on a logical plan node `ResolvedHint` to store query hints in logical plans, and on `Statistics` in physical plans. Since `ResolvedHint` is not really a logical operator and can break the pattern matching for existing and future optimization rules, it is a issue to the Optimizer as the old `AnalysisBarrier` was to the Analyzer. Given the fact that all our query hints are either 1) a join hint, i.e., broadcast hint; or 2) a re-partition hint, which is indeed an operator, we only need to add a hint field on the Join plan and that will be a good enough solution for the current hint usage. This PR is to let `Join` node have a hint for its left sub-tree and another hint for its right sub-tree and each hint is a merged result of all the effective hints specified in the corresponding sub-tree. The "effectiveness" of a hint, i.e., whether that hint should be propagated to the `Join` node, is currently consistent with the hint propagation rules originally implemented in the `Statistics` approach. Note that the `ResolvedHint` node still has to live through the analysis stage because of the `Dataset` interface, but it will be got rid of and moved to the `Join` node in the "pre-optimization" stage. This PR also introduces a change in how hints work with join reordering. Before this PR, hints would stop join reordering. For example, in "a.join(b).join(c).hint("broadcast").join(d)", the broadcast hint would stop d from participating in the cost-based join reordering while still allowing reordering from under the hint node. After this PR, though, the broadcast hint will not interfere with join reordering at all, and after reordering if a relation associated with a hint stays unchanged or equivalent to the original relation, the hint will be retained, otherwise will be discarded. For example, the original plan is like "a.join(b).hint("broadcast").join(c).hint("broadcast").join(d)", thus the join order is "a JOIN b JOIN c JOIN d". So if after reordering the join order becomes "a JOIN b JOIN (c JOIN d)", the plan will be like "a.join(b).hint("broadcast").join(c.join(d))"; but if after reordering the join order becomes "a JOIN c JOIN b JOIN d", the plan will be like "a.join(c).join(b).hint("broadcast").join(d)". ## How was this patch tested? Added new tests. Closes apache#23036 from maryannxue/query-hint. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…tive Hints ## What changes were proposed in this pull request? This is to fix a bug in apache#23036, which would lead to an exception in case of two consecutive hints. ## How was this patch tested? Added a new test. Closes apache#23501 from maryannxue/query-hint-followup. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? This is to fix a bug in apache#23036 that would cause a join hint to be applied on node it is not supposed to after join reordering. For example, ``` val join = df.join(df, "id") val broadcasted = join.hint("broadcast") val join2 = join.join(broadcasted, "id").join(broadcasted, "id") ``` There should only be 2 broadcast hints on `join2`, but after join reordering there would be 4. It is because the hint application in join reordering compares the attribute set for testing relation equivalency. Moreover, it could still be problematic even if the child relations were used in testing relation equivalency, due to the potential exprId conflict in nested self-join. As a result, this PR simply reverts the join reorder hint behavior change introduced in apache#23036, which means if a join hint is present, the join node itself will not participate in the join reordering, while the sub-joins within its children still can. ## How was this patch tested? Added new tests Closes apache#23524 from maryannxue/query-hint-followup-2. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

maryannxue added 2 commits November 14, 2018 11:02

[SPARK-26065][SQL] Change query hint from a LogicalPlan to a field

fce106d

[SPARK-26065][SQL] Change query hint from a LogicalPlan to a field

785a423

gatorsmile reviewed Dec 15, 2018

View reviewed changes

cloud-fan reviewed Dec 18, 2018

View reviewed changes

resolve conflicts

93f33d9

fix compilation errors

ee0c844

fix compilation errors

470d682

cloud-fan reviewed Jan 4, 2019

View reviewed changes

address review comments

f51e31d