-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2096][SQL] Correctly parse dot notations #2230
Conversation
Can one of the admins verify this patch? |
ok to test |
QA tests have started for PR 2230 at commit
|
QA tests have finished for PR 2230 at commit
|
sorry for the code style, fixed! Test again please |
QA tests have started for PR 2230 at commit
|
QA tests have finished for PR 2230 at commit
|
Thanks for working on this! The changes made to the parser seem reasonable to me. Thanks for the thorough explanation. Can you explain your changes to LogicalPlan a little more and add some inline comments. Thats a very crucial piece of code and I'm a little nervous about changing it. Also it seems like we might be missing the distinct logic based on the failing test case. |
QA tests have started for PR 2230 at commit
|
@marmbrus Sorry for missing the |
QA tests have finished for PR 2230 at commit
|
Yeah, I'd like to simplify this, but unfortunately I think this version introduces a regression for hive queries. I've made a PR (against your PR) that shows this regression. cloud-fan#1 Would be great if you could merge that and either roll back or propose an alternative. Thanks :) |
@marmbrus Seems hive parser will pass something like "a.b.c..." to |
I'm not sure how to modify |
Can one of the admins verify this patch? |
ok to test |
Hmm, does Hive support using dot notation to access fields that are arbitrarily nested in arrays? If not I think it would be better to just support one level. Also, the code added for that feature uses a lot of mutable state and is a little hard to follow (in addition to removing the type check). Since I'd really like to include your parser fixes and test cleanup, what do you think about breaking out the GetField on arrays change into another PR? |
QA tests have started for PR 2230 at commit
|
Actually hive doesn't support using dot notation to access fields of nested array, even one level. Anyway, I will put this support in another PR to keep this PR simple and clear :) |
QA tests have finished for PR 2230 at commit
|
The failed test case seems a regression test for a new fix. I have done rebase to include the new fix. Test again please. |
QA tests have started for PR 2230 at commit
|
QA tests have finished for PR 2230 at commit
|
Yeah I think the test failure was unrelated, though unfortunately this is out of date again. Mind updating one more time? Thanks! |
rebase done, test again please. |
Jenkins, test this please |
QA tests have started for PR 2230 at commit
|
QA tests have finished for PR 2230 at commit
|
Thanks for working on this! Merged to master. |
First let me write down the current
projections
grammar of spark sql:For something like
a.b.c[1]
, it will be parsed as:But for something like
a[1].b
, the current grammar can't parse it correctly.A simple solution is written in
ParquetQuerySuite#NestedSqlParser
, changed grammars are:This works well, but can't cover some corner case like
select t.a.b from table as t
:t.a.b
parsed asGetField(GetField(UnResolved("t"), "a"), "b")
instead ofGetField(UnResolved("t.a"), "b")
using this new grammar.However, we can't resolve
t
as it's not a filed, but the whole table.(if we could do this, thenselect t from table as t
is legal, which is unexpected)My solution is:
I passed all test cases under sql locally and add a more complex case.
"arrayOfStruct.field1 to access all values of field1" is not supported yet. Since this PR has changed a lot of code, I will open another PR for it.
I'm not familiar with the latter optimize phase, please correct me if I missed something.