-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2096][SQL] support dot notation on array of struct #2405
Conversation
Can one of the admins verify this patch? |
c42dcc7
to
b19bbd6
Compare
type EvaluatedType = Any | ||
|
||
def dataType = field.dataType | ||
def dataType = buildDataType(containsNullList, field.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use def
or lazy val
?
ok to test |
QA tests have started for PR 2405 at commit
|
QA tests have finished for PR 2405 at commit
|
Hmmm, I didn't create the class |
What will happen if I use this syntax in predicates? Let's say we have
What is the semantic of |
@yhuai It's hard to define the semantic of f1.f11 > f2.f22 as they are arbitrarily nested arrays. What if the array size is not equal? What if the nested level is not equal? Currently I just leave it there and maybe we could give user a meaningful error message to prohibit them to do so? |
Okay here are some thoughts and questions:
|
I think we can just handle one level nested array to fix SPARK-2096. What about adding a rule to using another type of |
that sounds pretty reasonable to me |
are there any plans on merging this soon? This is a pretty useful feature. |
+1 |
This PR is blocked by #2543. I'll update the code tomorrow and make it work :) |
b19bbd6
to
b016a81
Compare
Test build #24360 has started for PR 2405 at commit
|
Test build #24360 has finished for PR 2405 at commit
|
Test FAILed. |
Hi @marmbrus @liancheng, I have updated this PR to support |
Test build #24362 has started for PR 2405 at commit
|
Test build #24362 has finished for PR 2405 at commit
|
Test FAILed. |
6e9f94b
to
fa0d2c7
Compare
Test build #24363 has started for PR 2405 at commit
|
Test build #24363 has finished for PR 2405 at commit
|
Test PASSed. |
Hi @marmbrus @liancheng will this PR be part of the 1.2.0 branch ? |
Its not ready to be merged yet, and 1.2.0 has already been finalized. I think it would be great to revisit the implementation once #3724 goes in (hopefully today). I think we can add this feature simply by adding a few cases to the matches in |
Hi, @marmbrus ,the key point why I want to introduce At first, everything is good as For now, the searching field logic is duplicated in So I think we need extract the logic of resolving |
fa0d2c7
to
a2057e7
Compare
Test build #24819 has started for PR 2405 at commit
|
Test build #24819 has finished for PR 2405 at commit
|
Test PASSed. |
/cc @rxin Another API question. |
… of ambiguous reference to fields When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken. The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`. So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #4068 from cloud-fan/simple and squashes the following commits: a6857b5 [Wenchen Fan] fix import order 8411c40 [Wenchen Fan] use UnresolvedGetField
… of ambiguous reference to fields When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken. The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`. So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #4068 from cloud-fan/simple and squashes the following commits: a6857b5 [Wenchen Fan] fix import order 8411c40 [Wenchen Fan] use UnresolvedGetField (cherry picked from commit 4793c84) Signed-off-by: Michael Armbrust <michael@databricks.com>
a2057e7
to
08a228a
Compare
Test build #26994 has started for PR 2405 at commit
|
Test build #26994 has finished for PR 2405 at commit
|
Test PASSed. |
Thanks! Merging to master and 1.3 |
~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type. An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~ marmbrus Could you take a look? to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits: 08a228a [Wenchen Fan] support dot notation on array of struct (cherry picked from commit 0ee53eb) Signed-off-by: Michael Armbrust <michael@databricks.com>
The rule is simple: If you wanta.b
work, thena
must be some level of nested array of struct(level 0 means just a StructType). And the result ofa.b
is same level of nested array of b-type.An optimization is: the resolve chain looks like
Attribute -> GetItem -> GetField -> GetField ...
, so we could transmit the nested array information betweenGetItem
andGetField
to avoid repeated computation ofinnerDataType
andcontainsNullList
of that nested array.@marmbrus Could you take a look?
to evaluate
a.b
, ifa
is array of struct, thena.b
means get fieldb
on each element ofa
, and return a result of array.