feat(bigquery): Include `offset` parameter to unnest #8356

chelsea-lin · 2024-02-14T18:27:16Z

Is your feature request related to a problem?

It's not possible to express that I want to save the original order, such as with the WITH OFFSET AS clause for UNNEST in BigQuery. With the current version of unnest, it only returns a single column. I suppose it would be possible to add a with_offset parameter here, but then the return type might have to be a struct column or maybe a tuple of two columns.

This issue has the same request as #7781

Describe the solution you'd like

I'd like to add an offset parameter to the unnest function. When offset is True, the compiler can returns a sge.Posexplode() rather than sge.Explode() node. Then, sqlglot can handle the SQL translation as expected. Here is the draft PR: https://github.com/ibis-project/ibis/pull/8354/files
With that change, the translated SQL looks good to me:

In [1]: import ibis
   ...: bq = ibis.bigquery.connect(project_id="bigframes-dev")
   ...: table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")
   ...: ibis.to_sql(table.z.unnest(offset=True), dialect="bigquery")
Out[1]: 
SELECT
  IF(pos = pos_2, `z`, NULL) AS `z`,
  IF(pos = pos_2, pos_2, NULL) AS pos_2
FROM `bigframes-dev`.`ibis_gbq_testing`.`array_types` AS `t0`
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(`t0`.`z`)) - 1)) AS pos
CROSS JOIN UNNEST(`t0`.`z`) AS `z` WITH OFFSET AS pos_2
WHERE
  pos = pos_2
  OR (
    pos > (
      ARRAY_LENGTH(`t0`.`z`) - 1
    ) AND pos_2 = (
      ARRAY_LENGTH(`t0`.`z`) - 1
    )
  )

However, the ibis node type is not correct. The ibis assumes this node only returns a single value and ignore the pos_2 here. Here is the repro code:

In [3]: type(table.z.unnest(offset=True))
Out[3]: ibis.expr.types.numeric.FloatingColumn

In [7]: table.select(["scalar_column", table.z.unnest(offset=True)])
Out[7]: 
r0 := DatabaseTable: array_types
  x             array<!int64>
  y             array<!string>
  z             array<!float64>
  grouper       string
  scalar_column float64

Project[r0]
  scalar_column: r0.scalar_column
  z:             Unnest(r0.z, offset=True)

In [10]: table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 1
----> 1 table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2

File ~/src/forkibis2/ibis/expr/types/relations.py:859, in Table.__getattr__(self, key)
    854     hint = common_typos[key]
    855     raise AttributeError(
    856         f"{type(self).__name__} object has no attribute {key!r}, did you mean {hint!r}"
    857     )
--> 859 raise AttributeError(f"'Table' object has no attribute {key!r}")

AttributeError: 'Table' object has no attribute 'pos_2'

Any suggestions to correct the change so that ibis can return pos_2 as the offset?

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

BigQuery

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2024-02-15T15:32:56Z

Let's work through the API design first before jumping into compiled SQL.

The problem we need to solve to expose offset functionality may or may not have anything to do with the details of the compiled SQL.

cpcloud · 2024-02-15T15:59:24Z

Thinking about this some more I think this functionality warrants a new API.

I'll propose one: ArrayValue.enumerate(start=0).

This API would be analogous to Python's built-in function enumerate which accepts an Iterable[T] and returns Iterable[tuple[int, T]].

We'd do something similar, so the full signature of enumerate would be (using pseudocode):

# for ArrayValue[T]
def enumerate(self, start: int = 0) -> dt.StructValue[index: int, value: T]:
    ...

Thoughts?

We should probably experiment with two different flavors of backends (BigQuery and DuckDB) to determine feasibility of compilation here.

chelsea-lin · 2024-02-15T18:18:02Z

While the proposed enumerate is familier to Python users, it is important to consider how this will translate into SQL and work with database execution. Let's discuss a few concerns:

Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?
What's the equivalent API to table.select([table["grouper"], table.x.unnest(offset=True)])?
Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents table.arr.unnest() AS SELECT EXPLODE(arr) FROM table and let explode_to_unnest sqlglot.transformer to compile it into SELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).

chelsea-lin · 2024-02-15T18:19:46Z

In case you didn't see that, @tswast also proposed a new API at #7781 calling ArrayTable.as_table(offset) if we want to keep current unnest API unchanged.

chelsea-lin · 2024-02-15T18:25:34Z

If we want to continue the implementation for unnest(offset=True) API, I am trying to find a solution to return a struct in sqlglot as you commented at #8354
The solution is asking ibis compiler returns SELECT (SELECT AS STRUCT POSEXPLODE(arr)) FROM table and sqlglot can convert it into SELECT (SELECT AS STRUCT x, pos) FROM table FROM table CROSS JOIN UNNEST(x). But sqlglot does not work as expected, and I am trying to thinking about the solution.

NickCrews · 2024-02-15T23:48:22Z

driveby-ing: def enumerate(arr: ArrayValue) feels very close to def zip(arr: ArrayValue, *others: ArrayValue), and I have wanted zip before but did workarounds.

cpcloud · 2024-02-16T00:36:23Z

@NickCrews We do have array zip, not sure if you've seen that: https://ibis-project.org/reference/expression-collections#ibis.expr.types.arrays.ArrayValue.zip

cpcloud · 2024-02-21T18:40:58Z

@chelsea-lin I'll try to address your concerns here and provide a path forward:

Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?

Yes. Is that different from any other expression that takes input arguments?

What's the equivalent API to table.select([table["grouper"], table.x.unnest(offset=True)])?

table.select("grouper", table.x.enumerate())

Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents table.arr.unnest() AS SELECT EXPLODE(arr) FROM table and let explode_to_unnest sqlglot.transformer to compile it into SELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).

Start by implementing the enumerate API and don't worry about sqlglot until you can produce an Ibis expression.

Here's the approach I would take:

Define a new ops.Value subclass (class Enumerate(Value)) in ibis/expr/operations/arrays.py
Define an enumerate method on ArrayValue (in ibis/expr/types/arrays.py) that constructs ops.Enumerate(...).to_expr()
Figure out the SQL syntax that corresponds to the semantics of the operation
Figure out the sqlglot objects that correspond to that SQL syntax.

chelsea-lin · 2024-02-22T00:51:36Z

@cpcloud thanks! The implementation looks good overall. My remaining question is about ops.Enumerate.
It needs to return a dt.Struct and expect a sqlglot expression equaling to STRUCT(POSEXPLODE(arr)). Unfortunately, this cannot be directly converted to BigQuery syntax correctly (please check tobymao/sqlglot#2978). Also the issue is not reasonable for sqlglot because this is also not a correct spark syntax. This is preventing class Enumerate(Value) from producing the correct SQL. I am not familiar with spark sql, so do you have any thoughts here?
Also, if we cannot find an equal sqlglot expression, could using class Enumerate(Relation) be a better solution?

chelsea-lin · 2024-02-22T18:41:33Z

@cpcloud Trying to simple my previous question. If sqlglot cannot return a struct including arrays' items and positions, which dtype ops.Enumerate can return to respect the two columns from the generated SQL?

cpcloud · 2024-02-22T19:55:18Z

Ok, after experimenting in #8429, I don't think enumerate implemented like this is possible.

I'm going to take a different approach, and experiment with adding a Table.unnest(column) method instead, that will be the "tabular" unnest implementation for Ibis.

chelsea-lin · 2024-02-22T23:19:16Z

Thanks @cpcloud for prioritizing the offset parameter! While working on the Table.unnest(column) implementation, could you please also consider another feature requests to add preserved_empty parameter. This would ensure empty arrays are retained as null after the unnest operation.
As for the implementation, we can replace sqlglot expression from EXPLODE(x) into EXPLODE_OUTER(x), or POSEXPLODE(x) into POSEXPLODE_OUTER(x) if offset=True sets.
Thanks again here!

chelsea-lin · 2024-04-04T17:29:25Z

@cpcloud We've found a workaround solution (reliant on the feature in #8892). This code effectively unnests two columns sequentially, includes an offset ID, and handles empty arrays as null. Thanks to @TrevorBergeron for the helpful ideas!

import ibis
bq = ibis.bigquery.connect(project_id="bigframes-dev")
table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")

offset_array_id = "offset_array_0"
offset_array = ibis.generate_array(
    0, ibis.greatest(
        0, 
        ibis.least(table.x.length() - 1, table.y.length() - 1)
    )).name(offset_array_id)
table_w_offset_array = table.select(offset_array, table.x, table.y)

unnest_offset_id = "unnest_offset_0"
unnest_offset = table_w_offset_array[offset_array_id].unest().name(unnest_offset_id)
table_w_offset = table_w_offset_array.select(unnest_offset, table.x, table.y)

table_w_unnest = table_w_offset.select(
    table_w_offset[unnest_offset_id].name(unnest_offset_id),
    table_w_offset.x[table_w_offset[unnest_offset_id]].name("x"),
    table_w_offset.y[table_w_offset[unnest_offset_id]].name("y"),
)

cpcloud · 2024-07-26T15:03:07Z

This was closed by #9423 (Table.unnest), which supports an offset parameter.

chelsea-lin added the feature Features or general enhancements label Feb 14, 2024

github-project-automation bot added this to Ibis planning and roadmap Feb 14, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Feb 14, 2024

chelsea-lin mentioned this issue Feb 14, 2024

feat(bigquery): add the 'offset' parameter to unnest #8354

Closed

jcrist added the bigquery The BigQuery backend label Feb 15, 2024

cpcloud mentioned this issue Feb 22, 2024

feat(duckdb): add enumerate function to provide access to array element offsets #8429

Closed

cpcloud self-assigned this Feb 23, 2024

jcrist moved this from backlog to todo in Ibis planning and roadmap Feb 23, 2024

cpcloud closed this as completed Jul 26, 2024

github-project-automation bot moved this from todo to done in Ibis planning and roadmap Jul 26, 2024

cpcloud added this to the 9.2 milestone Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery): Include `offset` parameter to unnest #8356

feat(bigquery): Include `offset` parameter to unnest #8356

chelsea-lin commented Feb 14, 2024

cpcloud commented Feb 15, 2024

cpcloud commented Feb 15, 2024

chelsea-lin commented Feb 15, 2024 •

edited

Loading

chelsea-lin commented Feb 15, 2024 •

edited

Loading

chelsea-lin commented Feb 15, 2024

NickCrews commented Feb 15, 2024

cpcloud commented Feb 16, 2024

cpcloud commented Feb 21, 2024

chelsea-lin commented Feb 22, 2024 •

edited

Loading

chelsea-lin commented Feb 22, 2024 •

edited

Loading

cpcloud commented Feb 22, 2024

chelsea-lin commented Feb 22, 2024

chelsea-lin commented Apr 4, 2024

cpcloud commented Jul 26, 2024

feat(bigquery): Include offset parameter to unnest #8356

feat(bigquery): Include offset parameter to unnest #8356

Comments

chelsea-lin commented Feb 14, 2024

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Feb 15, 2024

cpcloud commented Feb 15, 2024

chelsea-lin commented Feb 15, 2024 • edited Loading

chelsea-lin commented Feb 15, 2024 • edited Loading

chelsea-lin commented Feb 15, 2024

NickCrews commented Feb 15, 2024

cpcloud commented Feb 16, 2024

cpcloud commented Feb 21, 2024

chelsea-lin commented Feb 22, 2024 • edited Loading

chelsea-lin commented Feb 22, 2024 • edited Loading

cpcloud commented Feb 22, 2024

chelsea-lin commented Feb 22, 2024

chelsea-lin commented Apr 4, 2024

cpcloud commented Jul 26, 2024

feat(bigquery): Include `offset` parameter to unnest #8356

feat(bigquery): Include `offset` parameter to unnest #8356

chelsea-lin commented Feb 15, 2024 •

edited

Loading

chelsea-lin commented Feb 15, 2024 •

edited

Loading

chelsea-lin commented Feb 22, 2024 •

edited

Loading

chelsea-lin commented Feb 22, 2024 •

edited

Loading