Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-29231][SQL] Constraints should be inferred from cast equality …
…constraint ### What changes were proposed in this pull request? This PR add support infer constraints from cast equality constraint. For example: ```scala scala> spark.sql("create table spark_29231_1(c1 bigint, c2 bigint)") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("create table spark_29231_2(c1 int, c2 bigint)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain == Physical Plan == *(2) Project [c1#5L, c2#6L] +- *(2) BroadcastHashJoin [c1#5L], [cast(c1#7 as bigint)], Inner, BuildRight :- *(2) Project [c1#5L, c2#6L] : +- *(2) Filter (isnotnull(c1#5L) AND (c1#5L = 1)) : +- *(2) ColumnarToRow : +- FileScan parquet default.spark_29231_1[c1#5L,c2#6L] Batched: true, DataFilters: [isnotnull(c1#5L), (c1#5L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#209] +- *(1) Project [c1#7] +- *(1) Filter isnotnull(c1#7) +- *(1) ColumnarToRow +- FileScan parquet default.spark_29231_2[c1#7] Batched: true, DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int> ``` After this PR: ```scala scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain == Physical Plan == *(2) Project [c1#0L, c2#1L] +- *(2) BroadcastHashJoin [c1#0L], [cast(c1#2 as bigint)], Inner, BuildRight :- *(2) Project [c1#0L, c2#1L] : +- *(2) Filter (isnotnull(c1#0L) AND (c1#0L = 1)) : +- *(2) ColumnarToRow : +- FileScan parquet default.spark_29231_1[c1#0L,c2#1L] Batched: true, DataFilters: [isnotnull(c1#0L), (c1#0L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#99] +- *(1) Project [c1#2] +- *(1) Filter ((cast(c1#2 as bigint) = 1) AND isnotnull(c1#2)) +- *(1) ColumnarToRow +- FileScan parquet default.spark_29231_2[c1#2] Batched: true, DataFilters: [(cast(c1#2 as bigint) = 1), isnotnull(c1#2)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #27252 from wangyum/SPARK-29231. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information