-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): unnest redux #7157
Conversation
cc @tswast for thoughts/review |
f03a392
to
b45bdcb
Compare
BigQuery tests are all passing:
|
5c2e50f
to
83384cf
Compare
I'm going to PR the literal refactor separately, it's independent of this PR. |
09a9280
to
c66ea66
Compare
BigQuery tests are passing:
|
8ea3cb0
to
79e0e85
Compare
1c4120b
to
5889388
Compare
BigQuery tests successful after rebasing on #7166:
|
5889388
to
630a3b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess I can't fully review this code. Generally things look fine though (sans a question about the TODO).
Before merging do you intend to squash/rewrite commit messages? Currently one reads as feat(bigquery): hack in unnest
, which feels a bit spooky to include in our release notes 😬.
630a3b5
to
85c0595
Compare
I haven't looked much at the implementation, but I figured I'd hit it hard against some weird schemas. Schema: [
{
"fields": [
{
"fields": [
{
"mode": "REPEATED",
"name": "doubly_nested_array",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "doubly_nested_field",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "nested_struct_col",
"type": "RECORD"
}
],
"mode": "REPEATED",
"name": "repeated_struct_col",
"type": "RECORD"
},
{
"mode": "NULLABLE",
"name": "rowindex",
"type": "INTEGER"
}
] Data: https://gist.github.com/tswast/a872310b88ba8406932017cdf13991fd Ibis sample: bq = ibis.bigquery.connect(project_id="swast-scratch")
table = bq.table("swast-scratch.my_dataset.array_test")
repeated_struct_col = table["repeated_struct_col"]
# Works as expected :-)
table.select([table["rowindex"], repeated_struct_col.unnest()]).to_pandas()
# Fails with AssertionError: got more than one unnest node: 2
# on File ~/src/ibis/ibis/backends/base/sql/compiler/query_builder.py:322, in Select.format_select_set(self)
table.select([table["rowindex"], repeated_struct_col.unnest()["nested_struct_col"].unnest()]).to_pandas() |
Your hack gives me an idea. I wonder if we can use correlated table subquery with an UNNEST as a named table expression or if we actually do need the "CROSS JOIN"? I'll do some experiments in SQL next week. |
I missed this part of correlated table subqueries: "You can only use these in the FROM clause." The following query does what I want, which is a correlated join:
That seems pretty well aligned with the approach you have so far:
We're on the right track, but I see why you have the restriction of a single node for the string replacement hack. I'll look more closely to see what we could do. |
85c0595
to
e84ad10
Compare
@tswast I took a totally different approach with the latest set of changes and I believe it's a I'm using sqlglot to transform functional unnest calls in select position to The way this works is:
The BigQuery unnest tests are all passing, which is pretty nice :) |
414fcdd
to
4c132bb
Compare
We're still not able to chain unnests yet (you need to select an |
@mark.broken( | ||
["bigquery"], | ||
raises=GoogleBadRequest, | ||
reason='400 Syntax error: Expected keyword JOIN but got identifier "SEMI"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlglot is also giving us the SEMI
/ANTI
join transformation to EXISTS
/NOT EXISTS
as well!
@@ -972,11 +972,6 @@ def query(t, group_cols): | |||
|
|||
@pytest.mark.notimpl(["dask", "pandas", "oracle"], raises=com.OperationNotDefinedError) | |||
@pytest.mark.notimpl(["druid"], raises=AssertionError) | |||
@pytest.mark.notyet( | |||
["bigquery"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pivot_longer
FTW!
ebb8d23
to
ea8371d
Compare
The one failure from the BigQuery backend tests should be addressed when tobymao/sqlglot#2290 is fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach.
In my tests in googleapis/python-bigquery-dataframes#53 it seems the sqlglot piece isn't causing any regressions that weren't already there in the master
branch.
f0b1ade
to
40f2d63
Compare
sqlglot had another release (18.7.0) that brings in the fixes we need for this PR. As soon as everything is green I'll merge. In the meantime, I'll run the cloud backend tests and post the results here. |
Cloud backend tests are passing:
|
40f2d63
to
11ce593
Compare
11ce593
to
c14aa88
Compare
This PR is a revival of #5767, with all the same caveats except that it employs
a string replacement hack to make
unnest
work in subqueries.I'm not sure if this is the best way to go about this, given our efforts to
move to sqlglot, but code has a way of sticking around longer than we'd like
and people have been asking for unnest for ... years.
I'd love to get some folks to kick the tires on this implementation before merging
it to
master
if possible.Depends on #7166.