[SPARK-2096][SQL] support dot notation on array of struct #2405

cloud-fan · 2014-09-16T04:39:29Z

The rule is simple: If you want a.b work, then a must be some level of nested array of struct(level 0 means just a StructType). And the result of a.b is same level of nested array of b-type.
An optimization is: the resolve chain looks like Attribute -> GetItem -> GetField -> GetField ..., so we could transmit the nested array information between GetItem and GetField to avoid repeated computation of innerDataType and containsNullList of that nested array.
@marmbrus Could you take a look?

to evaluate a.b, if a is array of struct, then a.b means get field b on each element of a, and return a result of array.

SparkQA · 2014-09-16T04:42:08Z

Can one of the admins verify this patch?

cloud-fan · 2014-09-16T05:18:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala

  type EvaluatedType = Any

-  def dataType = field.dataType
+  def dataType = buildDataType(containsNullList, field.dataType)


use def or lazy val?

marmbrus · 2014-09-16T19:11:29Z

ok to test

SparkQA · 2014-09-16T19:14:27Z

QA tests have started for PR 2405 at commit b19bbd6.

This patch merges cleanly.

SparkQA · 2014-09-16T20:44:43Z

QA tests have finished for PR 2405 at commit b19bbd6.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NonASCIICharacterChecker extends ScalariformChecker
- case class GetItem(child: Expression, ordinal: Expression) extends Expression
- case class GetField(child: Expression, fieldName: String) extends UnaryExpression

cloud-fan · 2014-09-17T02:24:35Z

Hmmm, I didn't create the class NonASCIICharacterChecker...
This fix also works for hql, but I'm not sure where to put the test case, any ideas?

yhuai · 2014-09-18T00:28:19Z

What will happen if I use this syntax in predicates?

Let's say we have

{
  "f1":[{"f11":1}, {"f11":2}],
  "f2":[{"f22":0}, {"f22":3}],
}

What is the semantic of f1.f11 > f2.f22?

cloud-fan · 2014-09-18T02:25:18Z

@yhuai It's hard to define the semantic of f1.f11 > f2.f22 as they are arbitrarily nested arrays. What if the array size is not equal? What if the nested level is not equal? Currently I just leave it there and maybe we could give user a meaningful error message to prohibit them to do so?

marmbrus · 2014-10-02T01:15:53Z

Okay here are some thoughts and questions:

I don't think it really matters that we can't handle f1.f11 > f2.f22 because we already don't know what do to if a user does [1,2] > [0,3] even without this new syntax.
Am I correct in saying that hive doesn't support this syntax at all and that we are inventing new functionality? I'm not strictly opposed to this, but we should be careful as once we support something we can't get rid of it later.
I'm not convinced that we need to handle arbitrary array nesting here. The case of getting all of one field from an array (which i guess makes this SQL short hand for array.map(_.fieldName)) seems reasonable, but is there a use case for the arbitrary nesting version?
This ends up complicating GetField quite a bit. What about creating a new expression type ArrayGetField and adding something to the analyzer that switches expression types when an array is detected. The idea here is to keep each expression simple so we can code-gen on a case by case basis.

cloud-fan · 2014-10-02T15:29:42Z

I think we can just handle one level nested array to fix SPARK-2096. What about adding a rule to using another type of GetField to handle array of struct? So that we can leave the GetField unchanged.

marmbrus · 2014-10-02T19:56:56Z

that sounds pretty reasonable to me

sziep · 2014-12-09T13:11:42Z

are there any plans on merging this soon? This is a pretty useful feature.

ayoub-benali · 2014-12-09T13:12:25Z

+1

cloud-fan · 2014-12-09T13:34:17Z

This PR is blocked by #2543. I'll update the code tomorrow and make it work :)

SparkQA · 2014-12-11T09:25:27Z

Test build #24360 has started for PR 2405 at commit b016a81.

This patch merges cleanly.

SparkQA · 2014-12-11T09:26:33Z

Test build #24360 has finished for PR 2405 at commit b016a81.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
- case class StructGetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression
- case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

AmplabJenkins · 2014-12-11T09:26:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24360/
Test FAILed.

cloud-fan · 2014-12-11T09:29:04Z

Hi @marmbrus @liancheng, I have updated this PR to support GetField on one level of array of struct for now. As I mentioned in #2543, resolving GetFiled during analyse phase makes things easy such as this PR. Please let me know if you think something is wrong here. Thanks!

SparkQA · 2014-12-11T09:35:13Z

Test build #24362 has started for PR 2405 at commit 6e9f94b.

This patch merges cleanly.

SparkQA · 2014-12-11T09:38:07Z

Test build #24362 has finished for PR 2405 at commit 6e9f94b.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
- case class StructGetField(child: Expression, field: StructField, ordinal: Int)
- case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

AmplabJenkins · 2014-12-11T09:38:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24362/
Test FAILed.

SparkQA · 2014-12-11T11:10:08Z

Test build #24363 has started for PR 2405 at commit fa0d2c7.

This patch merges cleanly.

SparkQA · 2014-12-11T12:21:36Z

Test build #24363 has finished for PR 2405 at commit fa0d2c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
- trait GetField extends UnaryExpression
- case class StructGetField(child: Expression, field: StructField, ordinal: Int)
- case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

AmplabJenkins · 2014-12-11T12:21:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24363/
Test PASSed.

ayoub-benali · 2014-12-17T12:50:25Z

Hi @marmbrus @liancheng will this PR be part of the 1.2.0 branch ?

marmbrus · 2014-12-17T19:10:20Z

Its not ready to be merged yet, and 1.2.0 has already been finalized. I think it would be great to revisit the implementation once #3724 goes in (hopefully today). I think we can add this feature simply by adding a few cases to the matches in GetField and in the Analyzer.

cloud-fan · 2014-12-18T02:23:16Z

Hi, @marmbrus ,the key point why I want to introduce UnResolvedGetField is that: for something like a.b[0].c.d, we first parse it to GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d"). Then in LogicalPlan#resolve, we resolve "a.b" and build a GetField chain from bottom(the relation). But for the 2 outer GetFiled, we have to resolve them in Analyzer or do it in GetField lazily, check data type of child, search needed field, etc. which is similar to what we have done in LogicalPlan#resolve.

At first, everything is good as GetField is quite simple. But when GetFiled get more complex, such as add resolver logic, support array type, etc. there will be more and more duplicated code in LogicalPlan#resolve and Analyzer and GetField.

For now, the searching field logic is duplicated in LogicalPlan#resolveNesting and GetField#field, but not consistent. In resolveNesting, we check ambiguous reference using fields.filter; in GetField#field, we are not.

So I think we need extract the logic of resolving GetField to prevent further trouble.

SparkQA · 2014-12-25T10:02:30Z

Test build #24819 has started for PR 2405 at commit a2057e7.

This patch merges cleanly.

SparkQA · 2014-12-25T11:14:52Z

Test build #24819 has finished for PR 2405 at commit a2057e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
- case class StructGetField(child: Expression, field: StructField, ordinal: Int) extends GetField
- case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)
- trait GetField extends UnaryExpression

AmplabJenkins · 2014-12-25T11:14:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24819/
Test PASSed.

marmbrus · 2015-01-10T23:04:38Z

/cc @rxin

Another API question.

… of ambiguous reference to fields When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken. The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`. So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #4068 from cloud-fan/simple and squashes the following commits: a6857b5 [Wenchen Fan] fix import order 8411c40 [Wenchen Fan] use UnresolvedGetField

… of ambiguous reference to fields When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken. The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`. So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #4068 from cloud-fan/simple and squashes the following commits: a6857b5 [Wenchen Fan] fix import order 8411c40 [Wenchen Fan] use UnresolvedGetField (cherry picked from commit 4793c84) Signed-off-by: Michael Armbrust <michael@databricks.com>

cloud-fan · 2015-02-07T09:21:52Z

Hi @marmbrus , since #4068 is merged, it's much simpler to implement this now. Do you have time to review it? Thanks!

SparkQA · 2015-02-07T09:22:49Z

Test build #26994 has started for PR 2405 at commit 08a228a.

This patch merges cleanly.

SparkQA · 2015-02-07T10:34:19Z

Test build #26994 has finished for PR 2405 at commit 08a228a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait GetField extends UnaryExpression
- case class StructGetField(child: Expression, field: StructField, ordinal: Int) extends GetField
- case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

AmplabJenkins · 2015-02-07T10:34:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26994/
Test PASSed.

marmbrus · 2015-02-10T00:39:57Z

Thanks! Merging to master and 1.3

~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type. An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~ marmbrus Could you take a look? to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits: 08a228a [Wenchen Fan] support dot notation on array of struct (cherry picked from commit 0ee53eb) Signed-off-by: Michael Armbrust <michael@databricks.com>

cloud-fan force-pushed the nested-array-dot branch from c42dcc7 to b19bbd6 Compare September 16, 2014 05:04

cloud-fan reviewed Sep 16, 2014
View reviewed changes

cloud-fan force-pushed the nested-array-dot branch from b19bbd6 to b016a81 Compare December 11, 2014 09:19

cloud-fan changed the title ~~[SPARK-2096][SQL] support dot notation on arbitrarily nested array of struct~~ [SPARK-2096][SQL] support dot notation on array of struct Dec 11, 2014

cloud-fan force-pushed the nested-array-dot branch from 6e9f94b to fa0d2c7 Compare December 11, 2014 11:03

cloud-fan force-pushed the nested-array-dot branch from fa0d2c7 to a2057e7 Compare December 25, 2014 09:57

cloud-fan mentioned this pull request Jan 16, 2015

[SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields #4068

Closed

support dot notation on array of struct

08a228a

cloud-fan force-pushed the nested-array-dot branch from a2057e7 to 08a228a Compare February 7, 2015 09:19

asfgit closed this in 0ee53eb Feb 10, 2015

cloud-fan deleted the nested-array-dot branch February 10, 2015 00:42

[SPARK-2096][SQL] support dot notation on array of struct #2405

[SPARK-2096][SQL] support dot notation on array of struct #2405

Conversation

cloud-fan commented Sep 16, 2014

SparkQA commented Sep 16, 2014

cloud-fan Sep 16, 2014

Choose a reason for hiding this comment

marmbrus commented Sep 16, 2014

SparkQA commented Sep 16, 2014

SparkQA commented Sep 16, 2014

cloud-fan commented Sep 17, 2014

yhuai commented Sep 18, 2014

cloud-fan commented Sep 18, 2014

marmbrus commented Oct 2, 2014

cloud-fan commented Oct 2, 2014

marmbrus commented Oct 2, 2014

sziep commented Dec 9, 2014

ayoub-benali commented Dec 9, 2014

cloud-fan commented Dec 9, 2014

SparkQA commented Dec 11, 2014

SparkQA commented Dec 11, 2014

AmplabJenkins commented Dec 11, 2014

cloud-fan commented Dec 11, 2014

SparkQA commented Dec 11, 2014

SparkQA commented Dec 11, 2014

AmplabJenkins commented Dec 11, 2014

SparkQA commented Dec 11, 2014

SparkQA commented Dec 11, 2014

AmplabJenkins commented Dec 11, 2014

ayoub-benali commented Dec 17, 2014

marmbrus commented Dec 17, 2014

cloud-fan commented Dec 18, 2014

SparkQA commented Dec 25, 2014

SparkQA commented Dec 25, 2014

AmplabJenkins commented Dec 25, 2014

marmbrus commented Jan 10, 2015

cloud-fan commented Feb 7, 2015

SparkQA commented Feb 7, 2015

SparkQA commented Feb 7, 2015

AmplabJenkins commented Feb 7, 2015

marmbrus commented Feb 10, 2015