-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11153] [SQL] Disables Parquet filter push-down for string and binary columns #9152
[SPARK-11153] [SQL] Disables Parquet filter push-down for string and binary columns #9152
Conversation
Test build #43850 has finished for PR 9152 at commit
|
retest this please |
Test build #43956 has finished for PR 9152 at commit
|
Jenkins, retest this please. |
Test build #43963 timed out for PR 9152 at commit |
retest this please |
Test build #43973 has finished for PR 9152 at commit
|
Merging to branch-1.5, and master if the merge script let me do that. Otherwise will open a separate PR for master. |
…inary columns Due to PARQUET-251, `BINARY` columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely: - `StringType` - `BinaryType` - `DecimalType` (But Spark SQL doesn't support pushing down filters involving `DecimalType` columns for now.) To avoid wrong query results, we should disable filter push-down for columns of `StringType` and `BinaryType` until we upgrade to parquet-mr 1.8. Author: Cheng Lian <lian@databricks.com> Closes #9152 from liancheng/spark-11153.workaround-parquet-251.
OK, merged to both branch-1.5 and master. |
Due to PARQUET-251,
BINARY
columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely:
StringType
BinaryType
DecimalType
(But Spark SQL doesn't support pushing down filters involving
DecimalType
columns for now.)To avoid wrong query results, we should disable filter push-down for columns of
StringType
andBinaryType
until we upgrade to parquet-mr 1.8.