-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): ArrayValue.as_table(offset_name: str | None)
as a table-valued function
#7781
Comments
ArrayValue.to_table(with_offset: bool)
as a table-valued functionArrayValue.as_table(with_offset: bool)
as a table-valued function
There's a few potential use cases embedded in this request that would be unlocked. I use the following randomly-generated newline-delimited JSON file to demonstrate these use cases: https://gist.github.com/tswast/f27c1a6082c54150e6353c9f6a2bd423 import ibis
bq = ibis.bigquery.connect(project_id="swast-scratch")
table = bq.table("swast-scratch.ibis7781_array_as_table.doubly_nested") include offset so that ordering can be maintained post-unnestIt's not possible to express that I want to save the original order, such as with the WITH OFFSET AS clause for UNNEST in BigQuery. With the current version of unnest, it only returns a single column. I suppose it would be possible to add a SELECT
event_order_id, event
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
UNNEST(t0.event_sequence) AS event
WITH OFFSET AS event_order_id
ORDER BY
customer_id, day, flag, event_order_id Proposed interface: events = table.event_sequence.as_table(offset_name="event_order_id")
joined = table.cross_join(events)
result = joined[
events["event_order_id"], events["event"]
].order_by([
ibis.asc(table.customer_id),
ibis.asc(table.day),
ibis.asc(table.flag),
ibis.asc(events.event_order_id),
]) unnest deeply nested columnsThe following should be possible, but currently fails. level_1 = table.event_sequence.unnest()
level_2 = level_1["data"].unnest()
print(bq.compile(level_2))
bq.execute(level_2) Fails with SELECT
IF(pos = pos_2, `data`, NULL) AS `data`
FROM `swast-scratch`.ibis7781_array_as_table.doubly_nested AS t0, UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`)) - 1)) AS pos
CROSS JOIN UNNEST(UNNEST(t0.`event_sequence`).`data`) AS `data` WITH OFFSET AS pos_2
WHERE
pos = pos_2
OR (
pos > (
ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`) - 1
)
AND pos_2 = (
ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`) - 1
)
) I believe representing the array as a table expression would allow for a more direct translation to BigQuery SQL. Proposed interface: events = table.event_sequence.as_table()
level_1 = table.cross_join(events)[
table["customer_id"], events["data"], events["timestamp"]
]
data = level_1["data"].as_table()
level_2 = level_1.cross_join(data)[
level_1["customer_id"], level_1["timestamp"], data["key"], data["value"]
] This would generate SQL like the following: SELECT
customer_id,
event_timestamp,
data.key AS `data_key`,
data.value AS `data_value`
FROM (
SELECT
customer_id,
event.timestamp AS `event_timestamp`,
DATA
FROM
`swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
UNNEST(t0.event_sequence) AS event ) t1
CROSS JOIN
UNNEST(t1.data) AS data Or more simply, all in one cross join: events = table.event_sequence.as_table()
data = events["data"].as_table()
result = table.cross_join(events, data)[
table["customer_id"], events["timestamp"], data["key"], data["value"]
] Which would generate the following SQL: SELECT
customer_id,
event.timestamp as `event_timestamp`,
data.key as `data_key`,
data.value as `data_value`
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
UNNEST(t0.event_sequence) AS event
CROSS JOIN
UNNEST(event.data) AS data using array literals as an alternative to memtable for cases where we explicitly want to embed the data in SQLPreviously (ibis 6.x), using memtable would generate SQL like the following: SELECT
t0.*
FROM UNNEST(
ARRAY<STRUCT<`Column One` INT64, `Column 2` STRING>>[
STRUCT(1, 'hello'),
STRUCT(2, 'world'),
STRUCT(3, '!')
]) AS t0 Now array = ibis.array([
ibis.struct({"Column 1": 1, "Column 2": "hello"}),
ibis.struct({"Column 1": 2, "Column 2": "world"}),
ibis.struct({"Column 1": 3, "Column 2": "!"}),
])
table = array.as_table() Note: this currently fails with
|
One more use case: keep rows after unnest with empty arraysIt's not always desired to eliminate rows where there are no values in the array. As seen in #7590, the default in pandas and snowflake is to preserve these empty arrays as NULL. By treating arrays as tables, it's possible to keep rows with empty arrays by doing a SELECT
customer_id, day, event.timestamp as `event_timestamp`
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
LEFT JOIN
UNNEST(t0.event_sequence) AS event Returns 284 rows, compared to 268 with Proposed interface: events = table.event_sequence.as_table()
joined = table.left_join(events)
result = joined[
table["customer_id"],
events["event"]["timestamp"],
] |
ArrayValue.as_table(with_offset: bool)
as a table-valued functionArrayValue.as_table(offset_name: str | None)
as a table-valued function
This shouldn't be a problem anymore (tried with both In [7]: array.as_table()
Out[7]:
DummyTable
ArrayColumn() array<struct<Column 1: int8, Column 2: string>> I think this use case should be supported. Let me know if there is more to do here, especially if this is the only blocker for you to migrate then I can prioritize this. |
@kszucs Thanks for following up this issue. You're right. The However, we still needs other user cases to be supported. Here, @tswast is trying to propose a new API to fill the gaps of current Taken the
The above SQL is generated by sqlglot by converting
The mapped SQL for the proposed user case would be:
As I asked in the Zulip, I am trying to implement the proposed API but need some helps. Thanks for your patience here! |
Is your feature request related to a problem?
It's difficult for me to do some things that I'm used to doing in BigQuery, such as
UNNEST([...])
on a bunch of literal values or a generated array, similar to howmemtable
worked in Ibis 6.x.Also, the current implementation of
unnest()
doesn't work on deeply nested arrays of structs of arrays in the current implementation. An alternative API closer to how BQ semantics work for unnest would be really useful, IMO.Describe the solution you'd like
As an alternative to the current
unnest()
semantics, which transforms anArrayColumn
intoXColumn
, I'd like to seeArrayValue.as_table()
which works similarly to UNNEST in BigQuery, where it's primarily used in aFROM
clause, often in a correlated join but not always.I suspect the "treat this unnested array as a table with/without offsets" could be doable in other engines besides BigQuery, especially if correlated joins are ignored for now.
What version of ibis are you running?
7.1.0
What backend(s) are you using, if any?
BigQuery
Code of Conduct
Edit: Updated to
as_table
to reflect similar function inStructValue
.The text was updated successfully, but these errors were encountered: