-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34819][SQL] MapType supports comparable semantics #32552
Conversation
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #138562 has finished for PR 32552 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #138570 has finished for PR 32552 at commit
|
7d6ab65
to
539a1e6
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #138573 has finished for PR 32552 at commit
|
Test build #138575 has finished for PR 32552 at commit
|
3c8b19a
to
d08942f
Compare
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #138579 has finished for PR 32552 at commit
|
d22a6e1
to
38e42c4
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #140384 has finished for PR 32552 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140411 has finished for PR 32552 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140889 has finished for PR 32552 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #141123 has finished for PR 32552 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status success |
Test build #141144 has finished for PR 32552 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Folks, what is the state of this PR? Do we expect to make progress on this? |
What changes were proposed in this pull request?
This PR proposes to support comparable semantics for map types.
NOTE: This PR is the rework of #31967(@WangGuangxin)/#15970(@hvanhovell).
The approach of the PR is similar to
NormalizeFloatingNumbers
and it has the same restriction; in the plan optimizing phase, a new rule namedNormalizeMaps
inserts an expressionSortMapKeys
to make sure two maps having the same key value pairs but with different key ordering are equal (e.g., Map('a' -> 1, 'b' -> 2) should equal to Map('b' -> 2, 'a' -> 1). As for aggregates, this rule is applied in the physical planning phase because all the grouping exprs are not extracted during the logical phase (This is the same restriction withNormalizeFloatingNumbers
).The major differences from
NormalizeFloatingNumbers
are as follows;EqualTo
,GreaterThan
, ...) andIn
/InSet
in a plan (NormalizeFloatingNumbers
is applied only into theEqualTo
comparison in a join plan, an equi-join).normalize
recursively and just adds aSortMapKeys
expr just on each top-level expr (e.g., top-level grouping expr and left/right side expr of binary comparisons).SortOrder
s in sort-related plans.For sorting map entries, I reused the array ordering logic (See:
MapType.compare
andCodegenContext.genComp
) because keys and values in map entries follow the array format; it checks if key arrays in two maps are the same first, an then check if value arrays are the same.NOTE: Adding duplicate
SortMapKeys
exprs in a binary comparison tree is a known issue; for example, in a query below,MapType
's column,a
, is sorted twice;But, I don't have a smart idea to avoid it in this PR for now. Probably, I think common subexpression elimination in filter plans can solve it, but Spark does not have the optimization now. (Fro more details, see the previous @viirya PR: #30565).
Why are the changes needed?
To improve map usability.
Does this PR introduce any user-facing change?
Yes, a user can use map-typed data in GROUP BY, ORDER BY, and PARTITION BY in WINDOW clauses.
How was this patch tested?
Add unit tests.