Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20094][SQL] Preventing push down of IN subquery to Join operator #17428

Closed
wants to merge 5 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Mar 25, 2017

What changes were proposed in this pull request?

TPCDS q45 fails becuase:
ReorderJoin collects all predicates and try to put them into join condition when creating ordered join. If a predicate with an IN subquery (ListQuery) is in a join condition instead of a filter condition, RewritePredicateSubquery.rewriteExistentialExpr would fail to convert the subquery to an ExistenceJoin, and thus result in error.

We should prevent push down of IN subquery to Join operator.

How was this patch tested?

Add a new test case in FilterPushdownSuite.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 25, 2017

cc @nsyca @hvanhovell

@wzhfy wzhfy changed the title [SPARK-20094][SQL] Don't put predicate with subquery into join condition in ReorderJoin because it fails RewritePredicateSubquery.rewriteExistentialExpr [SPARK-20094][SQL] Don't put predicate with IN subquery into join condition in ReorderJoin because it fails RewritePredicateSubquery.rewriteExistentialExpr Mar 25, 2017
@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75210 has finished for PR 17428 at commit bd91947.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75212 has finished for PR 17428 at commit 6af6b95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 27, 2017

also cc @cloud-fan

@@ -90,6 +90,7 @@ trait PredicateHelper {
* Returns true iff `expr` could be evaluated as a condition within join.
*/
protected def canEvaluateWithinJoin(expr: Expression): Boolean = expr match {
case l: ListQuery => false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the following case e: SubqueryExpression prevent it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comments to explain why ListQuery should always return false even the children is empty.

Copy link
Contributor Author

@wzhfy wzhfy Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Now pulling up IN predicates is deferred to RewritePredicateSubquery. As a result, ListQuery's children can be empty before that rule and falls into case e: SubqueryExpression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. The comment looks good.

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75257 has finished for PR 17428 at commit 9f7dac5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -289,6 +289,9 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT
case m: Map[_, _] => m
case d: DataType => d // Avoid unpacking Structs
case seq: Traversable[_] => seq.map(recursiveTransform)
case ExistenceJoin(exists: Attribute) =>
// `exists` must be an Attribute in ExistenceJoin
ExistenceJoin(transformExpression(exists).asInstanceOf[Attribute])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason we need to cast the return type of transformExpression(<instance_of_Attribute>) to type Attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ExistenceJoin only accepts Attribute as its parameter type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of transformExpression of an instance of Attribute not returning a value of type Attribute or its subtypes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary? This smells for two reasons:

  • QueryPlan shouldn't have know about such a detail. This means something is wrong with the abstraction.
  • Why do we want to change the attribute in an ExistenceJoin? It is produced by the join, and we really should not replace it.

Copy link
Contributor Author

@wzhfy wzhfy Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I feel the same. Fortunately, after modification as suggested by @nsyca , this change can also be removed. Because I move the test case to FilterPushdownSuite, we don't have to touch the RewritePredicateSubquery rule.

case l: ListQuery =>
// Now pulling up IN predicates is deferred to `RewritePredicateSubquery`. As a result,
// `ListQuery`'s children can be empty before that rule and falls into the next case.
false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested text in the comment:

[NOT] IN () expression cannot be evaluated as part of a Join operator. Currently the only way to evaluate a [NOT] IN subquery is to convert it to a LeftSemi or LeftAnti by RewritePredicateSubquery rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks!

.join(w, ExistenceJoin(exists), Some("z.c".attr === "w.d".attr))
.where("x.a".attr > 1 || exists)
.select("x.a".attr, "x.b".attr, "x.c".attr, "z.a".attr, "z.b".attr, "z.c".attr)
.join(y, Inner, Some("y.d".attr === "z.a".attr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side question. It's not a review comment.

Is there any mechanism in this test suite that ensures that the plan Join( Join(x, z), y) is picked by the join reorder algorithm? Isn't a Join( Join(y, z), x) is also a viable plan if the join predicate (y.d = z.a) is estimated to be more selective than the predicate sets between x and z?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test suite is for JoinReorder rule. This rule is not cost based and it's not related to estimation. It only searches for joinable tables and put cartesian product to the end of the plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was why Optimizer does not pick a plan of Join( Join(y, z), x).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess your test case depends on the fact that currently we don't have an optimization to push down table y below the join of (x, z).

Essentially, this is not a problem specifically to ReorderJoin as implied from the title of the PR. The problem is at the definition canEvaluateWithinJoin. I can demonstrate the same problem without the table y in this test case.

A slight modification of your test case as shown below also reveals the problem.

    val queryPlan = x.join(z)
      .where(("x.b".attr === "z.b".attr) &&
        ("x.a".attr > 1 || "z.c".attr.in(ListQuery(w.select("w.d".attr)))))

    val expectedPlan = x.join(z, Inner, Some("x.b".attr === "z.b".attr))
      .join(w, ExistenceJoin(exists), Some("z.c".attr === "w.d".attr))
      .where("x.a".attr > 1 || exists)
      .select("x.a".attr, "x.b".attr, "x.c".attr, "z.a".attr, "z.b".attr, "z.c".attr)

which has a call to canEvaluateWithinJoin from the rule PushPredicateThroughJoin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest the following:

  • Change the title of the PR to "Preventing push down of IN subquery to Join operator"
  • Update the test case to its bare minimum by removing the join to table y
  • Change the title of the test case to reflect the new test

On the last point, removing the "reorder inner joins - " should do.

Copy link
Contributor Author

@wzhfy wzhfy Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, thanks for the suggestion. Also I will move the test case to FilterPushdownSuite since now it will not go into ReorderJoin rule.

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75265 has finished for PR 17428 at commit c32599d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy wzhfy changed the title [SPARK-20094][SQL] Don't put predicate with IN subquery into join condition in ReorderJoin because it fails RewritePredicateSubquery.rewriteExistentialExpr [SPARK-20094][SQL] Preventing push down of IN subquery to Join operator Mar 28, 2017
@hvanhovell
Copy link
Contributor

LGTM - pending jenkins.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75308 has finished for PR 17428 at commit bf7d202.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master. Thanks!

@asfgit asfgit closed this in 91559d2 Mar 28, 2017
@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75309 has finished for PR 17428 at commit 0cc6137.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants