Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): unnest redux #7157

Merged
merged 3 commits into from
Sep 24, 2023
Merged

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 14, 2023

This PR is a revival of #5767, with all the same caveats except that it employs
a string replacement hack to make unnest work in subqueries.

I'm not sure if this is the best way to go about this, given our efforts to
move to sqlglot, but code has a way of sticking around longer than we'd like
and people have been asking for unnest for ... years.

I'd love to get some folks to kick the tires on this implementation before merging
it to master if possible.

Depends on #7166.

@cpcloud cpcloud added feature Features or general enhancements bigquery The BigQuery backend labels Sep 14, 2023
@cpcloud
Copy link
Member Author

cpcloud commented Sep 14, 2023

cc @tswast for thoughts/review

@cpcloud
Copy link
Member Author

cpcloud commented Sep 14, 2023

BigQuery tests are all passing:

cloud in 🌐 falcon in …/ibis on  bigquery-unnest is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env) took 2m50s
❯ pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
x.....x..........x.x..x.......x.x.xx...x.......x.x.........x...x.............x.xxxxxx....xxx.xxxxxxxxxxx..x..x.xx.x.x.x.x..x..x............x.x...x.xxxxx..x.x.x.xxx.x...x.x...x....xx......x.x [ 12%]
xxx.......x.xx........xxx..x...x..xxx.x..x......x.....x...........xx....xx.....xx.x.x.xxx.x.......xsxx...............x.x.....x...x......x..x.......x................xx.....x...x....x..xxx..x. [ 25%]
..x........x..x.xxx...x...x....x............x.x....x..x.....x............x....x.................x....x......xx.x.xx..xx..xxxxxx.x.xxxxxxssxsxxxx..x.x....xx.....xx..x..xx.......x....xx.x..x.. [ 38%]
.s.x.x...................x..x....x..................xxsxxx...x.x...x....x..xx....xx.....s.x....x....x..........x...x..x.....s........x...x......x....x.........xxx....xx..........x..x........ [ 51%]
...x.......x.......x.....x.............x.x..xxx........x........x.x..........x....x..x.......xx...........x...x..xx......x.........x........................x................................. [ 64%]
....x..............................x.............................................................x.....................x...................................................................... [ 77%]
.....................x.........x...............................................x......x.............x.....................................x.x..xxx..xx........................................ [ 89%]
..............................................................................................................................................s.......                                         [100%]
1223 passed, 9 skipped, 248 xfailed in 159.11s (0:02:39)

@cpcloud
Copy link
Member Author

cpcloud commented Sep 15, 2023

I'm going to PR the literal refactor separately, it's independent of this PR.

@cpcloud cpcloud force-pushed the bigquery-unnest branch 4 times, most recently from 09a9280 to c66ea66 Compare September 15, 2023 14:27
@cpcloud
Copy link
Member Author

cpcloud commented Sep 15, 2023

BigQuery tests are passing:

…/ibis on  bigquery-unnest is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env)
❯ pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
.x........x......x.......x...xxx.xx.x.xxxxx.....x.x.xx.xxx...xx........x..xxx...xxx.x....x.xxx.x..xxx.xx....xx.........xxx.....x...xxx..x....x..x..xx.....xx..x.x.xx....xx.....x..x...x....... [ 12%]
.........s...x.x...xx...xx..x...x..x....x.x..x.x.........xx.............xx........x..x.....xx...xx.x..xx...xxx.xx...........xx......x.........x........x.....................x................ [ 25%]
...............................x...xs...x.xx..x.x...x.....x....x..sx....x.........xxsx......xx...xx......xx........x........x.........x....x................x..xx...........x................. [ 38%]
........x.................xx.x.xxxxxxxxx..xxxx.xxxxxxxx....x...xx....x...xx...x.....x.xx.x......x.............x...x.x.x..x..x...x..x...x.xx..................xx...x....sxxxx.sxxxxx..sxxxxxxx. [ 51%]
.xx..x.xx....x....x...xx.xx....x...sxxx....x.x...x...........x.........xxx.......................x....x..........x...x.x..x..x..x........x.........x...........x...x.................xx....... [ 64%]
.............x................................................................................................................................................................................ [ 76%]
...........................................................................................x..........x......................................................x................................ [ 89%]
....x..............x..........................................................................................xxxx..x.x...............s...............                                         [100%]
1223 passed, 9 skipped, 248 xfailed in 209.95s (0:03:29)

@cpcloud cpcloud force-pushed the bigquery-unnest branch 2 times, most recently from 8ea3cb0 to 79e0e85 Compare September 15, 2023 14:53
@cpcloud cpcloud force-pushed the bigquery-unnest branch 2 times, most recently from 1c4120b to 5889388 Compare September 15, 2023 16:36
@cpcloud
Copy link
Member Author

cpcloud commented Sep 15, 2023

BigQuery tests successful after rebasing on #7166:

cloud in 🌐 falcon in …/ibis on  bigquery-unnest is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env)
❯ pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
.xxxx.........x..x......x..x..x....x..x...x..x.....x...x...x...x....x..xxx.........x....x...........x...........xx.x..x.x...x.xx..s.x.....xx.x...sx...xxx.x.x.s.xx......x....xxxxx.x.xxxxxxxxx [ 12%]
xxxxxxxxx.xx.x....x.x....xx.......x...x..x......xx....x..........x.........x.......xx.x...............x.......x.......x...x........xx..x................x.....x............x....x...x......... [ 25%]
..........x............x............x..............x..............x..........x...x.x...x.....x....x.....x...xxx.........................x......x.x....x...x.......x...........x............... [ 38%]
.x.......x.........xx.x....x.....x.x..x.x...x....x.....xx....x.................x.xx.xx..xx...xxxx.xxx..x.xx.xxx..xx..xx.x...x....x...x.xx.xxx....x.x....x..s..x.......x.......x......x........ [ 51%]
......x...x..x....x......xx..x.........x....x................x.x...........................x.x....xx.x..x.xx.x...x.........x........x..x...x.x..........................................x.x... [ 64%]
..............................xx............................s.....x.........s.s....................................................................x.......................................... [ 76%]
.............xx...x..x..xx.........s.....x...x...x.............................................................................x............................x................................. [ 89%]
.................x....xx..xx.......x.................................x..................s........x....x...x........xx.....x...x.xxx..xx.xxxxxxxxxx.....                                        [100%]
1224 passed, 9 skipped, 248 xfailed in 171.13s (0:02:51)

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I can't fully review this code. Generally things look fine though (sans a question about the TODO).

Before merging do you intend to squash/rewrite commit messages? Currently one reads as feat(bigquery): hack in unnest, which feels a bit spooky to include in our release notes 😬.

ibis/backends/base/sql/compiler/query_builder.py Outdated Show resolved Hide resolved
@tswast
Copy link
Collaborator

tswast commented Sep 15, 2023

I haven't looked much at the implementation, but I figured I'd hit it hard against some weird schemas.

Schema:

[
      {
        "fields": [
          {
            "fields": [
              {
                "mode": "REPEATED",
                "name": "doubly_nested_array",
                "type": "INTEGER"
              },
              {
                "mode": "NULLABLE",
                "name": "doubly_nested_field",
                "type": "STRING"
              }
            ],
            "mode": "REPEATED",
            "name": "nested_struct_col",
            "type": "RECORD"
          }
        ],
        "mode": "REPEATED",
        "name": "repeated_struct_col",
        "type": "RECORD"
      },
      {
        "mode": "NULLABLE",
        "name": "rowindex",
        "type": "INTEGER"
      }
]

Data: https://gist.github.com/tswast/a872310b88ba8406932017cdf13991fd

Ibis sample:

bq = ibis.bigquery.connect(project_id="swast-scratch")
table = bq.table("swast-scratch.my_dataset.array_test")
repeated_struct_col = table["repeated_struct_col"]

# Works as expected :-)
table.select([table["rowindex"], repeated_struct_col.unnest()]).to_pandas()

# Fails with AssertionError: got more than one unnest node: 2
# on File ~/src/ibis/ibis/backends/base/sql/compiler/query_builder.py:322, in Select.format_select_set(self)
table.select([table["rowindex"], repeated_struct_col.unnest()["nested_struct_col"].unnest()]).to_pandas()

@tswast
Copy link
Collaborator

tswast commented Sep 15, 2023

Your hack gives me an idea. I wonder if we can use correlated table subquery with an UNNEST as a named table expression or if we actually do need the "CROSS JOIN"? I'll do some experiments in SQL next week.

@tswast
Copy link
Collaborator

tswast commented Sep 16, 2023

I missed this part of correlated table subqueries: "You can only use these in the FROM clause."

The following query does what I want, which is a correlated join:

SELECT
  rowindex,
  t3 AS doubly_nested_array,
  t2.doubly_nested_field AS doubly_nested_field
FROM `swast-scratch.my_dataset.array_test` t0
CROSS JOIN UNNEST(t0.repeated_struct_col) t1
CROSS JOIN UNNEST(t1.nested_struct_col) t2
CROSS JOIN UNNEST(t2.doubly_nested_array) t3

That seems pretty well aligned with the approach you have so far:

SELECT t0.`rowindex`, `Unnest_repeated_struct_col` AS `repeated_struct_col`
FROM `swast-scratch.my_dataset.array_test` t0
  CROSS JOIN UNNEST(t0.`repeated_struct_col`) `Unnest_repeated_struct_col`

We're on the right track, but I see why you have the restriction of a single node for the string replacement hack. I'll look more closely to see what we could do.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 22, 2023

@tswast I took a totally different approach with the latest set of changes and I believe it's a
more robust solution.

I'm using sqlglot to transform functional unnest calls in select position to
relational unnest calls.

The way this works is:

  1. Ibis treats UNNEST as a unary function call and generates syntactically
    correct, but semantically invalid BigQuery SQL: if we did run the SQL
    it would parse but result in a semantic error from the BigQuery API.
  2. After Ibis generates the SQL string, we parse it with sqlglot.
  3. The parse results in a sqlglot expression, which I then edit to replace all
    functional unnest calls with sqlglot's generic expression type for this
    operation.
  4. Re-emit the SQL string using the BigQuery dialect from the sqlglot expression.

The BigQuery unnest tests are all passing, which is pretty nice :)

pyproject.toml Outdated Show resolved Hide resolved
@cpcloud
Copy link
Member Author

cpcloud commented Sep 22, 2023

We're still not able to chain unnests yet (you need to select an unnest() then use another .select to unnest further), but I think that should be addressed in a separate PR, as it is not a bigquery-specific problem.

@mark.broken(
["bigquery"],
raises=GoogleBadRequest,
reason='400 Syntax error: Expected keyword JOIN but got identifier "SEMI"',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlglot is also giving us the SEMI/ANTI join transformation to EXISTS/NOT EXISTS as well!

@@ -972,11 +972,6 @@ def query(t, group_cols):

@pytest.mark.notimpl(["dask", "pandas", "oracle"], raises=com.OperationNotDefinedError)
@pytest.mark.notimpl(["druid"], raises=AssertionError)
@pytest.mark.notyet(
["bigquery"],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pivot_longer FTW!

@cpcloud cpcloud marked this pull request as ready for review September 22, 2023 14:05
@cpcloud cpcloud force-pushed the bigquery-unnest branch 2 times, most recently from ebb8d23 to ea8371d Compare September 22, 2023 16:18
@cpcloud
Copy link
Member Author

cpcloud commented Sep 22, 2023

The one failure from the BigQuery backend tests should be addressed when tobymao/sqlglot#2290 is fixed.

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach.

In my tests in googleapis/python-bigquery-dataframes#53 it seems the sqlglot piece isn't causing any regressions that weren't already there in the master branch.

@cpcloud cpcloud force-pushed the bigquery-unnest branch 2 times, most recently from f0b1ade to 40f2d63 Compare September 24, 2023 10:17
@cpcloud cpcloud added this to the 7.0 milestone Sep 24, 2023
@cpcloud
Copy link
Member Author

cpcloud commented Sep 24, 2023

sqlglot had another release (18.7.0) that brings in the fixes we need for this PR.

As soon as everything is green I'll merge.

In the meantime, I'll run the cloud backend tests and post the results here.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 24, 2023

Cloud backend tests are passing:

…/ibis on  bigquery-unnest is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env) took 11s
❯ pytest -m 'bigquery or snowflake' -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
.x......x..........x...x......x.x......x.x....x...x..........................................x...............x......................x....x.................xx.x..x.x..x.................x...x [  6%]
x........x..............x......x............................................x.............x..........x.................x............................x......x....x..xx..................x..... [ 13%]
.....x..x..........................................................................................x.....................x.....xx.x......x.x.........x..x...x.xx.x......x...x.xx....xx.x.x... [ 20%]
.x..xx.x...x..xx.x.x.x..x...xx.xx....x..............x........xx.x.x.....x.xx.x.....x..x.x...x..........x........x.xx......x...x.s..s..s.s.........x....xx..........xx...x................x... [ 27%]
s........x......xxx...xxxx.xxx.x..x.....xxxx..x.x.x.x.x...x...x..xxxxx.x...x.x...................xx.x....x.x..xx.x....x...........xx...........x......x........x.........xx..xxx..x....x... [ 34%]
x..x......x.x.x.................x......x.......x....x............x.x........s...x..x..x..x...........................x...x......x........x..xxxxxxxx.xxxxx.x.xxx.x.x.xxx.......x......x.xxx.x [ 41%]
x....xxxxx..xx.xxxx.xx.x...x.x.x.x..x.xx.xxx.x....x....x..x.xx......x.....x...x.....x.xxxx.xxx.x..xxxxxxxx.xxxxxx..x.xxx.x.xx.x..x...xx.xxxx.x..x...x..x...x..x.x..x......x......x...x....... [ 47%]
..x....x........x..x...............x..x....................x.x........x................................................................................................x..................... [ 54%]
..................................................................................x....x......x.....x.......x........x.xx.x.............x..........x.....x..............x.x...x............x. [ 61%]
x..x.....x.......xx............x...x..x......x....................x............xxx...x..............sxsx....s.......s.x..xxsx...sx.x.....x.........x.x.........x.....x.sx..xxxxxxx.x.xxxxxxxx [ 68%]
xxx.xx..xxxx...x.xxx..xx.xx..............x.x...................x..xxx............................x..x.........xx..x....xx.......x.........xxx..x.......x.x..xx.x........x.......xx........... [ 75%]
...........xx...x.x...................x.sx...........................................x............x.......................................................................................... [ 82%]
...x...........................................................xxx.xx..x..............x.....x........................................................x...................x....s.............. [ 89%]
........................................................................................................................................................x.................................... [ 95%]
....................x...x......x............................x.x........x......xx..s..x..........................                                                                              [100%]
2315 passed, 16 skipped, 427 xfailed in 432.33s (0:07:12)

@cpcloud cpcloud merged commit 0908253 into ibis-project:master Sep 24, 2023
@cpcloud cpcloud deleted the bigquery-unnest branch September 24, 2023 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery The BigQuery backend feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants