Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25363][SQL] Fix schema pruning in where clause by ignoring unnecessary root fields #22357

Closed
wants to merge 5 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 7, 2018

What changes were proposed in this pull request?

Schema pruning doesn't work if nested column is used in where clause.

For example,

sql("select name.first from contacts where name.first = 'David'")

== Physical Plan ==
*(1) Project [name#19.first AS first#40]
+- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
   +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [], 
    PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>>

In above query plan, the scan node reads the entire schema of name column.

This issue is reported by:
#21320 (comment)

The cause is that we infer a root field from expression IsNotNull(name). However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields.

How was this patch tested?

Unit tests.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2018

cc @dbtsai

testSchemaPruning("select a single complex field and in where clause") {
val query = sql("select name.first from contacts where name.first = 'Jane'")
checkScan(query, "struct<name:struct<first:string>>")
checkAnswer(query, Row("Jane") :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another tests that select name.first and name.last, and apply where clause on name.first. We should only read name.first and name.last without name.middle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added test case for it.

@@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
*/
private def getRootFields(expr: Expression): Seq[RootField] = {
expr match {
case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty
Copy link
Member

@HyukjinKwon HyukjinKwon Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm .. shouldn't we exclude this only for filters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is in projects, I think we also don't need to include all nested fields?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the case mentioned here looks specific to the pushed filter itself. Can we add a simple test for project case as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, for instance, this case select address is not null, name.last from contacts it wouldn't work. I thought this is a quick bandaid fix to resolve a basic case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This was a case I didn't test. Fixed it and added test case.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95780 has finished for PR 22357 at commit 3de6ee2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95794 has finished for PR 22357 at commit d68f808.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95787 has finished for PR 22357 at commit 3de6ee2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95799 has finished for PR 22357 at commit d68f808.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mallman
Copy link
Contributor

mallman commented Sep 7, 2018

Hi @viirya,

Thanks for this PR! I have an alternative implementation which I'd like to submit for comparison. My implementation was something I removed from my original patch.

I hope to have my PR submitted sometime today. I have another PR to submit, too. I'll be sure to refer to your PR in mine.

Cheers.

@mallman
Copy link
Contributor

mallman commented Sep 8, 2018

I have reconstructed my original patch for this issue, but I've discovered it will require more work to complete. However, as part of that reconstruction I've discovered a couple of cases where our patches create different physical plans. The query results are the same, but I'm not sure which—if either—plan is correct. I want to go into detail on that, but it's complicated and I have to call it quits tonight. I have a flight in the morning, and I'll be on break next week.

In the meantime, I'll just copy and paste two queries—based on the data in ParquetSchemaPruningSuite.scala—with two query plans each.

First query:

select employer.id from contacts where employer is not null

This PR (as of d68f808) produces:

== Physical Plan ==
*(1) Project [employer#4442.id AS id#4452]
+- *(1) Filter isnotnull(employer#4442)
   +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: Parquet,
    PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)],
    ReadSchema: struct<employer:struct<id:int>>

My WIP patch produces:

== Physical Plan ==
*(1) Project [employer#4442.id AS id#4452]
+- *(1) Filter isnotnull(employer#4442)
   +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: Parquet,
    PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)],
    ReadSchema: struct<employer:struct<id:int,company:struct<name:string,address:string>>>

Second query:

select employer.id from contacts where employer.id = 0

This PR produces:

== Physical Plan ==
*(1) Project [employer#4297.id AS id#4308]
+- *(1) Filter (isnotnull(employer#4297) && (employer#4297.id = 0))
   +- *(1) FileScan parquet [employer#4297,p#4298] Batched: false, Format: Parquet,
    PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)],
    ReadSchema: struct<employer:struct<id:int>>

My WIP patch produces:

== Physical Plan ==
*(1) Project [employer#4445.id AS id#4456]
+- *(1) Filter (isnotnull(employer#4445.id) && (employer#4445.id = 0))
   +- *(1) FileScan parquet [employer#4445,p#4446] Batched: false, Format: Parquet,
    PartitionCount: 2, PartitionFilters: [], PushedFilters: [],
    ReadSchema: struct<employer:struct<id:int>>

I wanted to give my thoughts on the differences of these in detail, but I have to wrap up my work for the night. I'll be visiting family next week. I don't know how responsive I'll be in that time, but I'll at least try to check back.

Cheers.

@viirya
Copy link
Member Author

viirya commented Sep 9, 2018

Thanks! @mallman

For the first query, I think the query plan produced by your WIP patch is not correct. We don't need to read the company:struct from employer:struct.

For the second, your WIP patch doesn't push down IsNotNull(employer) predicate into
FileScan node.

That is the important difference I noticed for now.

@dbtsai
Copy link
Member

dbtsai commented Sep 10, 2018

@mallman It will be great that we can have this fix in 2.4 release as this can dramatically reduce the data being read in many applications which is the purpose of the original work.

As we're going to cut 2.4 RC soon, and you will not be available next week, can you post your implementation so we can work together to make this happen? We can now track the co-authorships in Spark and Github with contribution stats, so it will be cool to see multiple people collaborate on one PR :)

Thanks.

@@ -196,6 +201,9 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
*/
private def getRootFields(expr: Expression): Seq[RootField] = {
expr match {
// Those expressions don't really use the nested fields of a root field.
case i@(IsNotNull(_: Attribute) | IsNull(_: Attribute)) =>
getRootFields(i.children(0)).map(_.copy(contentAccessed = false))
case att: Attribute =>
RootField(StructField(att.name, att.dataType, att.nullable), derivedFromAtt = true) :: Nil
case SelectedField(field) => RootField(field, derivedFromAtt = false) :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

      case IsNotNull(_: Attribute) | IsNull(_: Attribute) =>
        expr.children.flatMap(getRootFields).map(_.copy(contentAccessed = false))
      case _ =>
        expr.children.flatMap(getRootFields)

@dbtsai
Copy link
Member

dbtsai commented Sep 10, 2018

In general, the approach looks right to me except couple minor points. Thanks.

*/
private case class RootField(field: StructField, derivedFromAtt: Boolean)
private case class RootField(field: StructField, derivedFromAtt: Boolean,
contentAccessed: Boolean = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting and please elaborate the comment

@@ -17,7 +17,7 @@

package org.apache.spark.sql.execution.datasources.parquet

import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeReference, Expression, NamedExpression}
import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeReference, Expression, IsNotNull, IsNull, NamedExpression}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long.

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which can be wildcard when there are more than 6 entities per https://github.com/databricks/scala-style-guide#imports

@dbtsai
Copy link
Member

dbtsai commented Sep 10, 2018

cc @beettlle

@@ -110,7 +110,12 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
val projectionRootFields = projects.flatMap(getRootFields)
val filterRootFields = filters.flatMap(getRootFields)

(projectionRootFields ++ filterRootFields).distinct
val (rootFields, optRootFields) = (projectionRootFields ++ filterRootFields)
.distinct.partition(_.contentAccessed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments here please.

checkAnswer(query2, Row("Jane", "Doe") :: Nil)

val query3 = sql("select name.first from contacts " +
"where employer.company.name = 'abc' and p = 1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say a user adds where employer.company is not null, can we still read schema with employer:struct<company:struct<name:string>>> as we only mark contentAccessed = false when IsNotNull is on an attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one query test for this case. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a nested field access in the query like employer.company.name, then we don't need other fields inside employ.company other than name.

But if there is no such access but just employer.company is not null in where clause, it will read full schema of employ.company.

@@ -196,6 +201,9 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
*/
private def getRootFields(expr: Expression): Seq[RootField] = {
expr match {
// Those expressions don't really use the nested fields of a root field.
case i@(IsNotNull(_: Attribute) | IsNull(_: Attribute)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: -> i @ (IsNotNull(_: ...

@@ -156,7 +161,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
// in the resulting schema may differ from their ordering in the logical relation's
// original schema
val mergedSchema = requestedRootFields
.map { case RootField(field, _) => StructType(Array(field)) }
.map { case RootField(field, _, _) => StructType(Array(field)) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but .map { root: RootField => StructType(Array(root.field)) } per https://github.com/databricks/scala-style-guide#pattern-matching

@viirya
Copy link
Member Author

viirya commented Sep 10, 2018

Thanks @dbtsai and @HyukjinKwon. Your comments are addressed.

case IsNotNull(SelectedField(field)) =>
RootField(field, derivedFromAtt = false, contentAccessed = false) :: Nil
case IsNull(SelectedField(field)) =>
RootField(field, derivedFromAtt = false, contentAccessed = false) :: Nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai The question you mentioned at https://github.com/apache/spark/pull/22357/files#r216204022 was addressed by this.

@viirya
Copy link
Member Author

viirya commented Sep 10, 2018

Btw, this PR isn't intended to address filter push down for schema pruning. I do think it should be another one topic.

@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95884 has finished for PR 22357 at commit d0b82e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95871 has finished for PR 22357 at commit d0b82e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 11, 2018

FYI, per further checking code and discussion with @dbtsai regarding with predicate pushdown, we know that predicate push down only works for primitive types on Parquet datasource. So both IsNotNull(employer) and IsNotNull(employer.id) are not actually pushed down to work at Parquet reader.

@felixcheung
Copy link
Member

if recall, parquet reader can have filter pushdown? only not so in spark parquet data source?

@mallman
Copy link
Contributor

mallman commented Sep 11, 2018

FYI, per further checking code and discussion with @dbtsai regarding with predicate pushdown, we know that predicate push down only works for primitive types on Parquet datasource. So both IsNotNull(employer) and IsNotNull(employer.id) are not actually pushed down to work at Parquet reader

I would expect IsNotNull(employer.id) to be pushed down. In any case, I misunderstood what that PushedFilters metadata item means in the FileScan part of the physical plan. I thought that was a Parquet filter, but sometimes it is not. In any case, I'm not concerned about supporting filter push down at this point. My concern is around its side effects, but that has been allayed.

// For them, if there are any nested fields accessed in the query, we don't need to add root
// field access of above expressions.
// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`,
// we don't need to read nested fields of `name` struct other than `first` field.
Copy link
Contributor

@mallman mallman Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble accepting this statement, but perhaps I'm reading too much into it (or not enough). Let me illustrate with a couple of queries and their physical plans.

Assuming the data model in ParquetSchemaPruningSuite.scala, the physical plan for the query

select employer.id from contacts where employer is not null

is

== Physical Plan ==
*(1) Project [employer#36.id AS id#46]
+- *(1) Filter isnotnull(employer#36)
   +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: Parquet,
    PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)],
    ReadSchema: struct<employer:struct<id:int>>

The physical plan for the query

select employer.id from contacts where employer.id is not null

is

== Physical Plan ==
*(1) Project [employer#36.id AS id#47]
+- *(1) Filter (isnotnull(employer#36) && isnotnull(employer#36.id))
   +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: Parquet,
    PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)],
    ReadSchema: struct<employer:struct<id:int>>

The read schemata are the same, but the query filters are not. The file scan for the second query looks as I would expect, but the scan for the first query appears to only read employer.id even though it needs to check employer is not null. If it only reads employer.id, how does it check that employer.company is not null? Perhaps employer.id is null but employer.company is not null for some row...

I have run some tests to validate that this PR is returning the correct results for both queries, and it is. But I don't understand why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first query, the constrain is employer is not null.

When employer.id is not null, employer will always not be null; as a result, this PR will work.

However, when employer.id is null, employer can be null or something, so we need to check if employer is something to return a null of employer.id.

I checked in the ParquetFilter, IsNotNull(employer) will be ignored since it's not a valid parquet filter as parquet doesn't support pushdown on the struct; thus, with this PR, this query will return wrong answer.

I think in this scenario, as @mallman suggested, we might need to read the full data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked in the ParquetFilter, IsNotNull(employer) will be ignored since it's not a valid parquet filter as parquet doesn't support pushdown on the struct; thus, with this PR, this query will return wrong answer.

We may not worry about wrong answer from datasource like Parquet in predicate pushdown. As not all predicates are supported by pushdown, we always have a SparkSQL Filter on top of scan node to make sure to receive correct answer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A complex column is null and its fields are null are different. I think we don't need to read all the fields to check if the complex column is null. In other words, in above case, when we only read employer.id and it is null, the predicate employer is not null will still be true because it is a complex column containing a null field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya, I see your point about the difference between a complex type being null and a subfield being null. So to answer the following query

select address from contacts where name is not null

do we need to read any of the fields in name? Or perhaps just read one arbitrary field of simple type, like name.first? That's surprising, but I'm starting to believe it's true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently under this PR, name will be fully read. This is not perfect. However, to pick one arbitrary field from name sounds a little bit hacky to me. WDYT? cc @dbtsai @cloud-fan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think this is not seen as schema pruning case in the sense of original PR, so maybe we can leave it as it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm okay with leaving it as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading any arbitrary field of simple type (which may not exist if it's a deeply nested struct), I think we should implement the pushdown with complex type in parquet with similar logic, and let parquet reader handle it.

@viirya Can you create a followup JIRA for this?

Thanks.

@dbtsai
Copy link
Member

dbtsai commented Sep 11, 2018

FYI, @mallman I'm working on having ParquetFilter to support IsNotNull(employer.id) to be pushed into parquet reader.

@HyukjinKwon
Copy link
Member

Can anyone point me out if there are non addressed comments or problems here? Looks pretty good to me. I think this is rather a bandaid, small and safe fix to get into branch-2.4.

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95931 has finished for PR 22357 at commit d684a0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkAnswer(query4, Row("Jane", "abc") :: Nil)
}

testSchemaPruning("select nullable complex field and having is null predicate") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean having is not null predicate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, thanks.

@mallman
Copy link
Contributor

mallman commented Sep 11, 2018

@viirya Please amend

// Pruning with filters is currently unsupported. As-is, the file reader will read the id column
// and the entire coL2 struct. Once pruning with filters has been implemented we can uncomment
// this line
// checkScan(query, "struct<id:int,coL2:struct<B:int>>")
to remove the explanatory comment, and uncomment the commented-out line of code.

@mallman
Copy link
Contributor

mallman commented Sep 11, 2018

I have some bad news. The methods testSchemaPruning and testMixedCasePruning do not set the configuration settings as expected. Fixing that reveals 6 failing tests for the mixed case tests. One of those failing tests involves testing the scan and answer for a query involving a filter condition.

Based on what I'm seeing, I think it's fair to say that schema pruning is broken under certain circumstances when using a table schema that includes column names with upper-case characters (note that the test schema for contacts in ParquetSchemaPruningSuite.scala includes no fields with upper-case characters).

Fortunately schema pruning is disabled by default, and I think it's still considered "experimental" technology.

I think that fixing ParquetSchemaPruningSuite.scala is pretty straightforward. Fixing the newly failing unit tests will be more effort.

In any case, I will create an issue in Jira and submit a PR.

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95945 has finished for PR 22357 at commit 03bf540.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 11, 2018

retest this please.

@mallman
Copy link
Contributor

mallman commented Sep 11, 2018

FYI, the PR I previously mentioned about fixing the use of withSQLConf is #22394.

@mallman
Copy link
Contributor

mallman commented Sep 11, 2018

This LGTM. I'm not going to submit a PR for my approach to this problem.

Thanks @viirya!

@mallman
Copy link
Contributor

mallman commented Sep 11, 2018

And FYI this is the Jira issue I promised in
#22357 (comment) yesterday: https://issues.apache.org/jira/browse/SPARK-25407.

@mallman
Copy link
Contributor

mallman commented Sep 11, 2018

FYI, @mallman I'm working on having ParquetFilter to support IsNotNull(employer.id) to be pushed into parquet reader.

That would be pretty cool.

@dbtsai
Copy link
Member

dbtsai commented Sep 11, 2018

LGTM.

Thank you all for participating the discussion. @cloud-fan and @gatorsmile, do you have any further comment? If not, I would like to merge it tomorrow into both master and rc branch as it's an important performance fix for schema pruning.

Thanks.

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95950 has finished for PR 22357 at commit 03bf540.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM from me too.

asfgit pushed a commit that referenced this pull request Sep 12, 2018
…ecessary root fields

## What changes were proposed in this pull request?

Schema pruning doesn't work if nested column is used in where clause.

For example,
```
sql("select name.first from contacts where name.first = 'David'")

== Physical Plan ==
*(1) Project [name#19.first AS first#40]
+- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
   +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [],
    PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>>
```

In above query plan, the scan node reads the entire schema of `name` column.

This issue is reported by:
#21320 (comment)

The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields.

## How was this patch tested?

Unit tests.

Closes #22357 from viirya/SPARK-25363.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
(cherry picked from commit 3030b82)
Signed-off-by: DB Tsai <d_tsai@apple.com>
@dbtsai
Copy link
Member

dbtsai commented Sep 12, 2018

Thanks all again. Merged into 2.4 branch and master.

@asfgit asfgit closed this in 3030b82 Sep 12, 2018
fjh100456 pushed a commit to fjh100456/spark that referenced this pull request Sep 13, 2018
…ecessary root fields

## What changes were proposed in this pull request?

Schema pruning doesn't work if nested column is used in where clause.

For example,
```
sql("select name.first from contacts where name.first = 'David'")

== Physical Plan ==
*(1) Project [name#19.first AS first#40]
+- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
   +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [],
    PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>>
```

In above query plan, the scan node reads the entire schema of `name` column.

This issue is reported by:
apache#21320 (comment)

The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields.

## How was this patch tested?

Unit tests.

Closes apache#22357 from viirya/SPARK-25363.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…ecessary root fields

Schema pruning doesn't work if nested column is used in where clause.

For example,
```
sql("select name.first from contacts where name.first = 'David'")

== Physical Plan ==
*(1) Project [name#19.first AS first#40]
+- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
   +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [],
    PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>>
```

In above query plan, the scan node reads the entire schema of `name` column.

This issue is reported by:
apache#21320 (comment)

The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields.

Unit tests.

Closes apache#22357 from viirya/SPARK-25363.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
(cherry picked from commit 3030b82)
Signed-off-by: DB Tsai <d_tsai@apple.com>

Ref: LIHADOOP-48531
@viirya viirya deleted the SPARK-25363 branch December 27, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants