Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11153] [SQL] Disables Parquet filter push-down for string and binary columns #9152

Conversation

liancheng
Copy link
Contributor

Due to PARQUET-251, BINARY columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely:

  • StringType

  • BinaryType

  • DecimalType

    (But Spark SQL doesn't support pushing down filters involving DecimalType columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of StringType and BinaryType until we upgrade to parquet-mr 1.8.

@SparkQA
Copy link

SparkQA commented Oct 16, 2015

Test build #43850 has finished for PR 9152 at commit 3584d03.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #43956 has finished for PR 9152 at commit 3584d03.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #43963 timed out for PR 9152 at commit 3584d03 after a configured wait of 175m.

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #43973 has finished for PR 9152 at commit 3584d03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

Merging to branch-1.5, and master if the merge script let me do that. Otherwise will open a separate PR for master.

asfgit pushed a commit that referenced this pull request Oct 21, 2015
…inary columns

Due to PARQUET-251, `BINARY` columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely:

- `StringType`

- `BinaryType`

- `DecimalType`

  (But Spark SQL doesn't support pushing down filters involving `DecimalType` columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of `StringType` and `BinaryType` until we upgrade to parquet-mr 1.8.

Author: Cheng Lian <lian@databricks.com>

Closes #9152 from liancheng/spark-11153.workaround-parquet-251.
@liancheng
Copy link
Contributor Author

OK, merged to both branch-1.5 and master.

@asfgit asfgit closed this in 89e6db6 Oct 21, 2015
@liancheng liancheng deleted the spark-11153.workaround-parquet-251 branch October 21, 2015 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants