Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): ArrayValue.as_table(offset_name: str | None) as a table-valued function #7781

Closed
1 task done
tswast opened this issue Dec 15, 2023 · 4 comments · Fixed by #9423
Closed
1 task done
Labels
bigquery The BigQuery backend feature Features or general enhancements

Comments

@tswast
Copy link
Collaborator

tswast commented Dec 15, 2023

Is your feature request related to a problem?

It's difficult for me to do some things that I'm used to doing in BigQuery, such as UNNEST([...]) on a bunch of literal values or a generated array, similar to how memtable worked in Ibis 6.x.

Also, the current implementation of unnest() doesn't work on deeply nested arrays of structs of arrays in the current implementation. An alternative API closer to how BQ semantics work for unnest would be really useful, IMO.

Describe the solution you'd like

As an alternative to the current unnest() semantics, which transforms an ArrayColumn into XColumn, I'd like to see ArrayValue.as_table() which works similarly to UNNEST in BigQuery, where it's primarily used in a FROM clause, often in a correlated join but not always.

I suspect the "treat this unnested array as a table with/without offsets" could be doable in other engines besides BigQuery, especially if correlated joins are ignored for now.

What version of ibis are you running?

7.1.0

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • I agree to follow this project's Code of Conduct

Edit: Updated to as_table to reflect similar function in StructValue.

@tswast tswast added the feature Features or general enhancements label Dec 15, 2023
@tswast tswast changed the title feat(bigquery): ArrayValue.to_table(with_offset: bool) as a table-valued function feat(bigquery): ArrayValue.as_table(with_offset: bool) as a table-valued function Dec 18, 2023
@tswast
Copy link
Collaborator Author

tswast commented Dec 18, 2023

There's a few potential use cases embedded in this request that would be unlocked. I use the following randomly-generated newline-delimited JSON file to demonstrate these use cases:

https://gist.github.com/tswast/f27c1a6082c54150e6353c9f6a2bd423

import ibis

bq = ibis.bigquery.connect(project_id="swast-scratch")
table = bq.table("swast-scratch.ibis7781_array_as_table.doubly_nested")

include offset so that ordering can be maintained post-unnest

It's not possible to express that I want to save the original order, such as with the WITH OFFSET AS clause for UNNEST in BigQuery. With the current version of unnest, it only returns a single column. I suppose it would be possible to add a with_offset parameter here, but then the return type might have to be a struct column or maybe a tuple of two columns.

SELECT
  event_order_id, event
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
  UNNEST(t0.event_sequence) AS event
  WITH OFFSET AS event_order_id
ORDER BY
  customer_id, day, flag, event_order_id

Proposed interface:

events = table.event_sequence.as_table(offset_name="event_order_id")
joined = table.cross_join(events)
result = joined[
    events["event_order_id"], events["event"]
].order_by([
    ibis.asc(table.customer_id),
    ibis.asc(table.day),
    ibis.asc(table.flag),
    ibis.asc(events.event_order_id),
])

unnest deeply nested columns

The following should be possible, but currently fails.

level_1 = table.event_sequence.unnest()
level_2 = level_1["data"].unnest()
print(bq.compile(level_2))
bq.execute(level_2)

Fails with BadRequest: 400 Syntax error: Expected ")" but got keyword UNNEST at [3:114], and it generates the following SQL:

SELECT
  IF(pos = pos_2, `data`, NULL) AS `data`
FROM `swast-scratch`.ibis7781_array_as_table.doubly_nested AS t0, UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`)) - 1)) AS pos
CROSS JOIN UNNEST(UNNEST(t0.`event_sequence`).`data`) AS `data` WITH OFFSET AS pos_2
WHERE
  pos = pos_2
  OR (
    pos > (
      ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`) - 1
    )
    AND pos_2 = (
      ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`) - 1
    )
  )

I believe representing the array as a table expression would allow for a more direct translation to BigQuery SQL.

Proposed interface:

events = table.event_sequence.as_table()
level_1 = table.cross_join(events)[
    table["customer_id"], events["data"], events["timestamp"]
]
data = level_1["data"].as_table()
level_2 = level_1.cross_join(data)[
    level_1["customer_id"], level_1["timestamp"], data["key"], data["value"]
]

This would generate SQL like the following:

SELECT
  customer_id,
  event_timestamp,
  data.key AS `data_key`,
  data.value AS `data_value`
FROM (
  SELECT
    customer_id,
    event.timestamp AS `event_timestamp`,
    DATA
  FROM
    `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
  CROSS JOIN
    UNNEST(t0.event_sequence) AS event ) t1
CROSS JOIN
  UNNEST(t1.data) AS data

Or more simply, all in one cross join:

events = table.event_sequence.as_table()
data = events["data"].as_table()
result = table.cross_join(events, data)[
    table["customer_id"], events["timestamp"], data["key"], data["value"]
]

Which would generate the following SQL:

SELECT
  customer_id,
  event.timestamp as `event_timestamp`,
  data.key as `data_key`,
  data.value as `data_value`
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
  UNNEST(t0.event_sequence) AS event
CROSS JOIN
  UNNEST(event.data) AS data

using array literals as an alternative to memtable for cases where we explicitly want to embed the data in SQL

Previously (ibis 6.x), using memtable would generate SQL like the following:

SELECT
  t0.*
FROM UNNEST(
  ARRAY<STRUCT<`Column One` INT64, `Column 2` STRING>>[
    STRUCT(1, 'hello'),
    STRUCT(2, 'world'),
    STRUCT(3, '!')
]) AS t0

Now memtable loads the data to a temporary table in the anonymous dataset, which is more flexible with regards to larger input sizes, but less efficient for very small arrays like this one. With the proposed API, we should be able to generate this SQL again with the following syntax:

array = ibis.array([
    ibis.struct({"Column 1": 1, "Column 2": "hello"}),
    ibis.struct({"Column 1": 2, "Column 2": "world"}),
    ibis.struct({"Column 1": 3, "Column 2": "!"}),
])
table = array.as_table()

Note: this currently fails with unnest(), as there's no table expression in the tree:

bq.execute(array.unnest())
File /opt/miniconda3/envs/scratch/lib/python3.10/site-packages/ibis/expr/types/generic.py:1174, in Value.as_table(self)
   1168 if len(roots) > 1:
   1169     raise com.RelationError(
   1170         f"Cannot convert {type(self)} expression "
   1171         "involving multiple base table references "
   1172         "to a projection"
   1173     )
-> 1174 table = roots[0].to_expr()
   1175 return table.select(self)

IndexError: list index out of range

@tswast
Copy link
Collaborator Author

tswast commented Dec 18, 2023

One more use case:

keep rows after unnest with empty arrays

It's not always desired to eliminate rows where there are no values in the array. As seen in #7590, the default in pandas and snowflake is to preserve these empty arrays as NULL.

By treating arrays as tables, it's possible to keep rows with empty arrays by doing a LEFT JOIN instead of a CROSS JOIN.

SELECT
  customer_id, day, event.timestamp as `event_timestamp`
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
LEFT JOIN
  UNNEST(t0.event_sequence) AS event

Returns 284 rows, compared to 268 with CROSS JOIN.

Proposed interface:

events = table.event_sequence.as_table()
joined = table.left_join(events)
result = joined[
    table["customer_id"],
    events["event"]["timestamp"],
]

@tswast tswast changed the title feat(bigquery): ArrayValue.as_table(with_offset: bool) as a table-valued function feat(bigquery): ArrayValue.as_table(offset_name: str | None) as a table-valued function Dec 18, 2023
@jcrist jcrist added the bigquery The BigQuery backend label Jan 26, 2024
@kszucs
Copy link
Member

kszucs commented Jan 28, 2024

array = ibis.array([
    ibis.struct({"Column 1": 1, "Column 2": "hello"}),
    ibis.struct({"Column 1": 2, "Column 2": "world"}),
    ibis.struct({"Column 1": 3, "Column 2": "!"}),
])
table = array.as_table()

Note: this currently fails with unnest(), as there's no table expression in the tree:

bq.execute(array.unnest())
File /opt/miniconda3/envs/scratch/lib/python3.10/site-packages/ibis/expr/types/generic.py:1174, in Value.as_table(self)
  1168 if len(roots) > 1:
  1169     raise com.RelationError(
  1170         f"Cannot convert {type(self)} expression "
  1171         "involving multiple base table references "
  1172         "to a projection"
  1173     )
-> 1174 table = roots[0].to_expr()
  1175 return table.select(self)

IndexError: list index out of range

This shouldn't be a problem anymore (tried with both TES and main) since unbound scalar expressions are turned into a dummy table:

In [7]: array.as_table()
Out[7]:
DummyTable
  ArrayColumn() array<struct<Column 1: int8, Column 2: string>>

I think this use case should be supported.

Let me know if there is more to do here, especially if this is the only blocker for you to migrate then I can prioritize this.

@chelsea-lin
Copy link
Contributor

chelsea-lin commented Jan 29, 2024

@kszucs Thanks for following up this issue. You're right. The bq.execute(array.unnest()) issue has been fixed.

However, we still needs other user cases to be supported. Here, @tswast is trying to propose a new API to fill the gaps of current unnest function. Below are my understanding from the BigQuery SQL aspect. Hope that can help you understand the issue as well.

Taken the array_types table from the test_unnest_no_nulls ibis test as example, ibis.to_sql(array_types.x.unnest()) can generate the BigQuery SQL as below:

SELECT
  IF(pos = pos_2, `x`, NULL) AS `x`
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t0.`x`)) - 1)) AS pos
CROSS JOIN UNNEST(t0.`x`) AS `x` WITH OFFSET AS pos_2
WHERE
  pos = pos_2
  OR (
   pos > (
     ARRAY_LENGTH(t0.`x`) - 1
    ) AND pos_2 = (
      ARRAY_LENGTH(t0.`x`) - 1
    )
  )

The above SQL is generated by sqlglot by converting EXPLODE to UNNEST (related ibis commit). The SQL is a little bit nagging and can be simplified as below for better explaining:

SELECT x
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
CROSS JOIN UNNEST(t0.`x`) AS `x`

The mapped SQL for the proposed user case would be:

  • Case 1: include offset so that ordering can be maintained post-unnest
    The expected SQL would be (adding WITH OFFSET AS pos):
SELECT x, pos
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
CROSS JOIN UNNEST(t0.`x`) AS `x`
WITH OFFSET AS pos
  • Case 2: unnest deeply nested columns
    Skip here as array_types does not have nested columns.

  • Case 3: using array literals as an alternative to memtable for cases where we explicitly want to embed the data in SQL
    The issue is gone in the unnest() API, but still need to be supported in the `as_table() API.

  • Case 4: keep rows after unnest with empty arrays
    The expected SQL would be (switching CROSS JOIN to LEFT JOIN):

SELECT x
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
LEFT JOIN UNNEST(t0.`x`) AS `x`

As I asked in the Zulip, I am trying to implement the proposed API but need some helps. Thanks for your patience here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery The BigQuery backend feature Features or general enhancements
Projects
Archived in project
4 participants