[SPARK-34819][SQL] MapType supports comparable semantics #32552

maropu · 2021-05-14T15:37:19Z

What changes were proposed in this pull request?

This PR proposes to support comparable semantics for map types.

NOTE: This PR is the rework of #31967(@WangGuangxin)/#15970(@hvanhovell).

The approach of the PR is similar to NormalizeFloatingNumbers and it has the same restriction; in the plan optimizing phase, a new rule named NormalizeMaps inserts an expression SortMapKeys to make sure two maps having the same key value pairs but with different key ordering are equal (e.g., Map('a' -> 1, 'b' -> 2) should equal to Map('b' -> 2, 'a' -> 1). As for aggregates, this rule is applied in the physical planning phase because all the grouping exprs are not extracted during the logical phase (This is the same restriction with NormalizeFloatingNumbers).

The major differences from NormalizeFloatingNumbers are as follows;

The rule covers all the binary comparisons (EqualTo, GreaterThan, ...) and In/InSet in a plan (NormalizeFloatingNumbers is applied only into the EqualTo comparison in a join plan, an equi-join).
This rule does not apply normalize recursively and just adds a SortMapKeys expr just on each top-level expr (e.g., top-level grouping expr and left/right side expr of binary comparisons).
This rule additionally handles SortOrders in sort-related plans.

For sorting map entries, I reused the array ordering logic (See: MapType.compare and CodegenContext.genComp) because keys and values in map entries follow the array format; it checks if key arrays in two maps are the same first, an then check if value arrays are the same.

NOTE: Adding duplicate SortMapKeys exprs in a binary comparison tree is a known issue; for example, in a query below, MapType's column, a, is sorted twice;

scala> Seq((Map(1->1), Map(1->2), Map(1->1))).toDF("a", "b", "c").write.saveAsTable("t")
scala> sql("select * from t where a = b and a = c").explain()
== Physical Plan ==
*(1) Filter ((sortmapkeys(a#35) = sortmapkeys(b#36)) AND (sortmapkeys(a#35) = sortmapkeys(c#37)))
+- FileScan parquet default.t[a#35,b#36,c#37] Batched: false, DataFilters: [(sortmapkeys(a#35) = sortmapkeys(b#36)), (sortmapkeys(a#35) = sortmapkeys(c#37))], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:map<int,int>,b:map<int,int>,c:map<int,int>>

But, I don't have a smart idea to avoid it in this PR for now. Probably, I think common subexpression elimination in filter plans can solve it, but Spark does not have the optimization now. (Fro more details, see the previous @viirya PR: #30565).

Why are the changes needed?

To improve map usability.

Does this PR introduce any user-facing change?

Yes, a user can use map-typed data in GROUP BY, ORDER BY, and PARTITION BY in WINDOW clauses.

How was this patch tested?

Add unit tests.

sql/core/src/test/resources/sql-tests/results/map.sql.out

SparkQA · 2021-05-14T16:18:34Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43083/

SparkQA · 2021-05-14T16:51:34Z

Test build #138562 has finished for PR 32552 at commit 33eb7c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-15T05:14:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43091/

SparkQA · 2021-05-15T05:14:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43091/

SparkQA · 2021-05-15T09:03:15Z

Test build #138570 has finished for PR 32552 at commit f3f8019.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-15T16:23:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43094/

SparkQA · 2021-05-15T16:23:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43094/

SparkQA · 2021-05-15T17:17:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43096/

SparkQA · 2021-05-15T17:22:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43096/

SparkQA · 2021-05-15T18:26:42Z

Test build #138573 has finished for PR 32552 at commit 7d6ab65.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-15T19:18:59Z

Test build #138575 has finished for PR 32552 at commit 539a1e6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-16T01:15:10Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43100/

SparkQA · 2021-05-16T02:07:14Z

Test build #138579 has finished for PR 32552 at commit d08942f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-16T04:18:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43104/

SparkQA · 2021-05-16T04:18:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43104/

maropu · 2021-05-16T06:54:04Z

cc: @hvanhovell @cloud-fan @viirya @WangGuangxin

SparkQA · 2021-06-29T21:36:35Z

Test build #140384 has finished for PR 32552 at commit 29dd475.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SortMapKeys(child: Expression) extends UnaryExpression with ExpectsInputTypes

maropu · 2021-06-30T00:16:00Z

retest this please

SparkQA · 2021-06-30T03:22:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44926/

SparkQA · 2021-06-30T03:57:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44926/

SparkQA · 2021-06-30T05:57:56Z

Test build #140411 has finished for PR 32552 at commit 29dd475.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SortMapKeys(child: Expression) extends UnaryExpression with ExpectsInputTypes

maropu · 2021-07-12T00:55:55Z

retest this please

SparkQA · 2021-07-12T02:02:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45400/

SparkQA · 2021-07-12T02:36:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45400/

SparkQA · 2021-07-12T04:05:04Z

Test build #140889 has finished for PR 32552 at commit 29dd475.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SortMapKeys(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2021-07-16T05:44:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45636/

SparkQA · 2021-07-16T06:22:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45636/

SparkQA · 2021-07-16T07:43:15Z

Test build #141123 has finished for PR 32552 at commit 3be3882.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-07-16T07:49:43Z

retest this please

SparkQA · 2021-07-16T09:36:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45655/

SparkQA · 2021-07-16T09:46:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45656/

SparkQA · 2021-07-16T10:09:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45655/

SparkQA · 2021-07-16T10:23:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45656/

SparkQA · 2021-07-16T12:27:57Z

Test build #141144 has finished for PR 32552 at commit 3be3882.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-26T10:58:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47311/

SparkQA · 2021-08-26T11:06:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47311/

github-actions · 2021-12-05T00:12:04Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

rekbun · 2024-07-31T15:01:58Z

Folks, what is the state of this PR? Do we expect to make progress on this?

maropu commented May 14, 2021

View reviewed changes

sql/core/src/test/resources/sql-tests/results/map.sql.out Outdated Show resolved Hide resolved

maropu commented May 14, 2021

View reviewed changes

sql/core/src/test/resources/sql-tests/results/map.sql.out Outdated Show resolved Hide resolved

github-actions bot added the SQL label May 14, 2021

maropu force-pushed the pr31967 branch from 33eb7c4 to f3f8019 Compare May 15, 2021 04:17

maropu force-pushed the pr31967 branch 2 times, most recently from 7d6ab65 to 539a1e6 Compare May 15, 2021 15:56

maropu force-pushed the pr31967 branch 4 times, most recently from 3c8b19a to d08942f Compare May 16, 2021 00:23

maropu force-pushed the pr31967 branch 2 times, most recently from d22a6e1 to 38e42c4 Compare May 16, 2021 03:28

maropu marked this pull request as ready for review May 16, 2021 06:52

maropu changed the title ~~[WIP][SPARK-34819][SPARK-18134][SQL] MapType supports comparable semantics~~ [SPARK-34819][SPARK-18134][SQL] MapType supports comparable semantics May 16, 2021

maropu force-pushed the pr31967 branch from ab6237c to 29dd475 Compare June 28, 2021 01:33

WangGuangxin and others added 2 commits July 16, 2021 11:14

MapType supports comparable/orderable semantics

9ae4c53

Update the golden file

3be3882

maropu force-pushed the pr31967 branch from 29dd475 to 3be3882 Compare July 16, 2021 02:22

c21 mentioned this pull request Aug 9, 2021

[SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive #33679

Closed

github-actions bot added the Stale label Dec 5, 2021

github-actions bot closed this Dec 6, 2021

c27kwan mentioned this pull request Sep 6, 2022

[SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData #37771

Closed

[SPARK-34819][SQL] MapType supports comparable semantics #32552

[SPARK-34819][SQL] MapType supports comparable semantics #32552

Conversation

maropu commented May 14, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented May 14, 2021

SparkQA commented May 14, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 15, 2021

SparkQA commented May 16, 2021

SparkQA commented May 16, 2021

SparkQA commented May 16, 2021

SparkQA commented May 16, 2021

maropu commented May 16, 2021

SparkQA commented Jun 29, 2021

maropu commented Jun 30, 2021

SparkQA commented Jun 30, 2021

SparkQA commented Jun 30, 2021

SparkQA commented Jun 30, 2021

maropu commented Jul 12, 2021

SparkQA commented Jul 12, 2021

SparkQA commented Jul 12, 2021

SparkQA commented Jul 12, 2021

SparkQA commented Jul 16, 2021

SparkQA commented Jul 16, 2021

SparkQA commented Jul 16, 2021

maropu commented Jul 16, 2021

SparkQA commented Jul 16, 2021

SparkQA commented Jul 16, 2021

SparkQA commented Jul 16, 2021

SparkQA commented Jul 16, 2021

SparkQA commented Jul 16, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

github-actions bot commented Dec 5, 2021

rekbun commented Jul 31, 2024

maropu commented May 14, 2021 •

edited

Loading