Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2096][SQL] support dot notation on array of struct #2405

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

The rule is simple: If you want a.b work, then a must be some level of nested array of struct(level 0 means just a StructType). And the result of a.b is same level of nested array of b-type.
An optimization is: the resolve chain looks like Attribute -> GetItem -> GetField -> GetField ..., so we could transmit the nested array information between GetItem and GetField to avoid repeated computation of innerDataType and containsNullList of that nested array.

@marmbrus Could you take a look?

to evaluate a.b, if a is array of struct, then a.b means get field b on each element of a, and return a result of array.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

Can one of the admins verify this patch?

type EvaluatedType = Any

def dataType = field.dataType
def dataType = buildDataType(containsNullList, field.dataType)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use def or lazy val?

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have started for PR 2405 at commit b19bbd6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have finished for PR 2405 at commit b19bbd6.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NonASCIICharacterChecker extends ScalariformChecker
    • case class GetItem(child: Expression, ordinal: Expression) extends Expression
    • case class GetField(child: Expression, fieldName: String) extends UnaryExpression

@cloud-fan
Copy link
Contributor Author

Hmmm, I didn't create the class NonASCIICharacterChecker...
This fix also works for hql, but I'm not sure where to put the test case, any ideas?

@yhuai
Copy link
Contributor

yhuai commented Sep 18, 2014

What will happen if I use this syntax in predicates?

Let's say we have

{
  "f1":[{"f11":1}, {"f11":2}],
  "f2":[{"f22":0}, {"f22":3}],
}

What is the semantic of f1.f11 > f2.f22?

@cloud-fan
Copy link
Contributor Author

@yhuai It's hard to define the semantic of f1.f11 > f2.f22 as they are arbitrarily nested arrays. What if the array size is not equal? What if the nested level is not equal? Currently I just leave it there and maybe we could give user a meaningful error message to prohibit them to do so?

@marmbrus
Copy link
Contributor

marmbrus commented Oct 2, 2014

Okay here are some thoughts and questions:

  • I don't think it really matters that we can't handle f1.f11 > f2.f22 because we already don't know what do to if a user does [1,2] > [0,3] even without this new syntax.
  • Am I correct in saying that hive doesn't support this syntax at all and that we are inventing new functionality? I'm not strictly opposed to this, but we should be careful as once we support something we can't get rid of it later.
  • I'm not convinced that we need to handle arbitrary array nesting here. The case of getting all of one field from an array (which i guess makes this SQL short hand for array.map(_.fieldName)) seems reasonable, but is there a use case for the arbitrary nesting version?
  • This ends up complicating GetField quite a bit. What about creating a new expression type ArrayGetField and adding something to the analyzer that switches expression types when an array is detected. The idea here is to keep each expression simple so we can code-gen on a case by case basis.

@cloud-fan
Copy link
Contributor Author

I think we can just handle one level nested array to fix SPARK-2096. What about adding a rule to using another type of GetField to handle array of struct? So that we can leave the GetField unchanged.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 2, 2014

that sounds pretty reasonable to me

@sziep
Copy link

sziep commented Dec 9, 2014

are there any plans on merging this soon? This is a pretty useful feature.

@ayoub-benali
Copy link

+1

@cloud-fan
Copy link
Contributor Author

This PR is blocked by #2543. I'll update the code tomorrow and make it work :)

@cloud-fan cloud-fan changed the title [SPARK-2096][SQL] support dot notation on arbitrarily nested array of struct [SPARK-2096][SQL] support dot notation on array of struct Dec 11, 2014
@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24360 has started for PR 2405 at commit b016a81.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24360 has finished for PR 2405 at commit b016a81.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
    • case class StructGetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression
    • case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24360/
Test FAILed.

@cloud-fan
Copy link
Contributor Author

Hi @marmbrus @liancheng, I have updated this PR to support GetField on one level of array of struct for now. As I mentioned in #2543, resolving GetFiled during analyse phase makes things easy such as this PR. Please let me know if you think something is wrong here. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24362 has started for PR 2405 at commit 6e9f94b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24362 has finished for PR 2405 at commit 6e9f94b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
    • case class StructGetField(child: Expression, field: StructField, ordinal: Int)
    • case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24362/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24363 has started for PR 2405 at commit fa0d2c7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24363 has finished for PR 2405 at commit fa0d2c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
    • trait GetField extends UnaryExpression
    • case class StructGetField(child: Expression, field: StructField, ordinal: Int)
    • case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24363/
Test PASSed.

@ayoub-benali
Copy link

Hi @marmbrus @liancheng will this PR be part of the 1.2.0 branch ?

@marmbrus
Copy link
Contributor

Its not ready to be merged yet, and 1.2.0 has already been finalized. I think it would be great to revisit the implementation once #3724 goes in (hopefully today). I think we can add this feature simply by adding a few cases to the matches in GetField and in the Analyzer.

@cloud-fan
Copy link
Contributor Author

Hi, @marmbrus ,the key point why I want to introduce UnResolvedGetField is that: for something like a.b[0].c.d, we first parse it to GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d"). Then in LogicalPlan#resolve, we resolve "a.b" and build a GetField chain from bottom(the relation). But for the 2 outer GetFiled, we have to resolve them in Analyzer or do it in GetField lazily, check data type of child, search needed field, etc. which is similar to what we have done in LogicalPlan#resolve.

At first, everything is good as GetField is quite simple. But when GetFiled get more complex, such as add resolver logic, support array type, etc. there will be more and more duplicated code in LogicalPlan#resolve and Analyzer and GetField.

For now, the searching field logic is duplicated in LogicalPlan#resolveNesting and GetField#field, but not consistent. In resolveNesting, we check ambiguous reference using fields.filter; in GetField#field, we are not.

So I think we need extract the logic of resolving GetField to prevent further trouble.

@SparkQA
Copy link

SparkQA commented Dec 25, 2014

Test build #24819 has started for PR 2405 at commit a2057e7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 25, 2014

Test build #24819 has finished for PR 2405 at commit a2057e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
    • case class StructGetField(child: Expression, field: StructField, ordinal: Int) extends GetField
    • case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)
    • trait GetField extends UnaryExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24819/
Test PASSed.

@marmbrus
Copy link
Contributor

/cc @rxin

Another API question.

asfgit pushed a commit that referenced this pull request Feb 6, 2015
… of ambiguous reference to fields

When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #4068 from cloud-fan/simple and squashes the following commits:

a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField
asfgit pushed a commit that referenced this pull request Feb 6, 2015
… of ambiguous reference to fields

When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #4068 from cloud-fan/simple and squashes the following commits:

a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField

(cherry picked from commit 4793c84)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@cloud-fan
Copy link
Contributor Author

Hi @marmbrus , since #4068 is merged, it's much simpler to implement this now. Do you have time to review it? Thanks!

@SparkQA
Copy link

SparkQA commented Feb 7, 2015

Test build #26994 has started for PR 2405 at commit 08a228a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 7, 2015

Test build #26994 has finished for PR 2405 at commit 08a228a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait GetField extends UnaryExpression
    • case class StructGetField(child: Expression, field: StructField, ordinal: Int) extends GetField
    • case class ArrayGetField(child: Expression, field: StructField, ordinal: Int, containsNull: Boolean)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26994/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks! Merging to master and 1.3

asfgit pushed a commit that referenced this pull request Feb 10, 2015
~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type.
An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~
marmbrus Could you take a look?

to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits:

08a228a [Wenchen Fan] support dot notation on array of struct

(cherry picked from commit 0ee53eb)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 0ee53eb Feb 10, 2015
@cloud-fan cloud-fan deleted the nested-array-dot branch February 10, 2015 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants