Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify the logic for column pruning, projection, and filtering of table scans. #213

Closed
wants to merge 3 commits into from

Conversation

marmbrus
Copy link
Contributor

This removes duplicated logic, dead code and casting when planning parquet table scans and hive table scans.

Other changes:

  • Fix tests now that we are doing a better job of column pruning (i.e., since pruning predicates are applied before we even start scanning tuples, columns required by these predicates do not need to be included in the output of the scan unless they are also included in the final output of this logical plan fragment).
  • Add rule to simplify trivial filters. This was required to avoid WHERE false from getting pushed into table scans, since HiveTableScan (reasonably) refuses to apply partition pruning predicates to non-partitioned tables.

@marmbrus
Copy link
Contributor Author

@liancheng and @AndreSchumacher, please take a look.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13383/

@liancheng
Copy link
Contributor

Two unused imports are left in HiveStrategies.scala.

object ParquetScans extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
// TODO: need to support writing to other types of files.
case logical.WriteToFile(path, child) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little confusing to have a write operation within a scan strategy here... Maybe we should move this into a separate strategy object similar to HiveStrategies.DataSinks?

@liancheng
Copy link
Contributor

LGTM, much cleaner :)

…e scans for both Hive and Parquet relations. Fix tests now that we are doing a better job of column pruning.
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13420/

@pwendell
Copy link
Contributor

Merged, thanks

@asfgit asfgit closed this in b637f2d Mar 25, 2014
@marmbrus marmbrus deleted the strategyCleanup branch March 27, 2014 00:06
mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 12, 2017
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
* use beta candidate

* all backports added to build

* move/add supervize with checkpointing test to hdfs

* add kerberos args to supervise test

* fix job watchers

* add native blas

* remove old supervise test
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants