Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34819][SQL] MapType supports comparable semantics #32552

Closed
wants to merge 2 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented May 14, 2021

What changes were proposed in this pull request?

This PR proposes to support comparable semantics for map types.

NOTE: This PR is the rework of #31967(@WangGuangxin)/#15970(@hvanhovell).

The approach of the PR is similar to NormalizeFloatingNumbers and it has the same restriction; in the plan optimizing phase, a new rule named NormalizeMaps inserts an expression SortMapKeys to make sure two maps having the same key value pairs but with different key ordering are equal (e.g., Map('a' -> 1, 'b' -> 2) should equal to Map('b' -> 2, 'a' -> 1). As for aggregates, this rule is applied in the physical planning phase because all the grouping exprs are not extracted during the logical phase (This is the same restriction with NormalizeFloatingNumbers).

The major differences from NormalizeFloatingNumbers are as follows;

  • The rule covers all the binary comparisons (EqualTo, GreaterThan, ...) and In/InSet in a plan (NormalizeFloatingNumbers is applied only into the EqualTo comparison in a join plan, an equi-join).
  • This rule does not apply normalize recursively and just adds a SortMapKeys expr just on each top-level expr (e.g., top-level grouping expr and left/right side expr of binary comparisons).
  • This rule additionally handles SortOrders in sort-related plans.

For sorting map entries, I reused the array ordering logic (See: MapType.compare and CodegenContext.genComp) because keys and values in map entries follow the array format; it checks if key arrays in two maps are the same first, an then check if value arrays are the same.

NOTE: Adding duplicate SortMapKeys exprs in a binary comparison tree is a known issue; for example, in a query below, MapType's column, a, is sorted twice;

scala> Seq((Map(1->1), Map(1->2), Map(1->1))).toDF("a", "b", "c").write.saveAsTable("t")
scala> sql("select * from t where a = b and a = c").explain()
== Physical Plan ==
*(1) Filter ((sortmapkeys(a#35) = sortmapkeys(b#36)) AND (sortmapkeys(a#35) = sortmapkeys(c#37)))
+- FileScan parquet default.t[a#35,b#36,c#37] Batched: false, DataFilters: [(sortmapkeys(a#35) = sortmapkeys(b#36)), (sortmapkeys(a#35) = sortmapkeys(c#37))], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:map<int,int>,b:map<int,int>,c:map<int,int>>

But, I don't have a smart idea to avoid it in this PR for now. Probably, I think common subexpression elimination in filter plans can solve it, but Spark does not have the optimization now. (Fro more details, see the previous @viirya PR: #30565).

Why are the changes needed?

To improve map usability.

Does this PR introduce any user-facing change?

Yes, a user can use map-typed data in GROUP BY, ORDER BY, and PARTITION BY in WINDOW clauses.

How was this patch tested?

Add unit tests.

@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43083/

@github-actions github-actions bot added the SQL label May 14, 2021
@SparkQA
Copy link

SparkQA commented May 14, 2021

Test build #138562 has finished for PR 32552 at commit 33eb7c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43091/

@SparkQA
Copy link

SparkQA commented May 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43091/

@SparkQA
Copy link

SparkQA commented May 15, 2021

Test build #138570 has finished for PR 32552 at commit f3f8019.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the pr31967 branch 2 times, most recently from 7d6ab65 to 539a1e6 Compare May 15, 2021 15:56
@SparkQA
Copy link

SparkQA commented May 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43094/

@SparkQA
Copy link

SparkQA commented May 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43094/

@SparkQA
Copy link

SparkQA commented May 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43096/

@SparkQA
Copy link

SparkQA commented May 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43096/

@SparkQA
Copy link

SparkQA commented May 15, 2021

Test build #138573 has finished for PR 32552 at commit 7d6ab65.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2021

Test build #138575 has finished for PR 32552 at commit 539a1e6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the pr31967 branch 4 times, most recently from 3c8b19a to d08942f Compare May 16, 2021 00:23
@SparkQA
Copy link

SparkQA commented May 16, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43100/

@SparkQA
Copy link

SparkQA commented May 16, 2021

Test build #138579 has finished for PR 32552 at commit d08942f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the pr31967 branch 2 times, most recently from d22a6e1 to 38e42c4 Compare May 16, 2021 03:28
@SparkQA
Copy link

SparkQA commented May 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43104/

@SparkQA
Copy link

SparkQA commented May 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43104/

@maropu maropu marked this pull request as ready for review May 16, 2021 06:52
@maropu maropu changed the title [WIP][SPARK-34819][SPARK-18134][SQL] MapType supports comparable semantics [SPARK-34819][SPARK-18134][SQL] MapType supports comparable semantics May 16, 2021
@maropu
Copy link
Member Author

maropu commented May 16, 2021

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140384 has finished for PR 32552 at commit 29dd475.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SortMapKeys(child: Expression) extends UnaryExpression with ExpectsInputTypes

@maropu
Copy link
Member Author

maropu commented Jun 30, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44926/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44926/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Test build #140411 has finished for PR 32552 at commit 29dd475.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SortMapKeys(child: Expression) extends UnaryExpression with ExpectsInputTypes

@maropu
Copy link
Member Author

maropu commented Jul 12, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45400/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45400/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Test build #140889 has finished for PR 32552 at commit 29dd475.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SortMapKeys(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45636/

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45636/

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Test build #141123 has finished for PR 32552 at commit 3be3882.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jul 16, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45655/

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45656/

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45655/

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45656/

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Test build #141144 has finished for PR 32552 at commit 3be3882.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47311/

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47311/

@github-actions
Copy link

github-actions bot commented Dec 5, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@rekbun
Copy link

rekbun commented Jul 31, 2024

Folks, what is the state of this PR? Do we expect to make progress on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants