Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): Include offset parameter to unnest #8356

Closed
1 task done
chelsea-lin opened this issue Feb 14, 2024 · 14 comments
Closed
1 task done

feat(bigquery): Include offset parameter to unnest #8356

chelsea-lin opened this issue Feb 14, 2024 · 14 comments
Assignees
Labels
bigquery The BigQuery backend feature Features or general enhancements
Milestone

Comments

@chelsea-lin
Copy link
Contributor

Is your feature request related to a problem?

It's not possible to express that I want to save the original order, such as with the WITH OFFSET AS clause for UNNEST in BigQuery. With the current version of unnest, it only returns a single column. I suppose it would be possible to add a with_offset parameter here, but then the return type might have to be a struct column or maybe a tuple of two columns.

This issue has the same request as #7781

Describe the solution you'd like

I'd like to add an offset parameter to the unnest function. When offset is True, the compiler can returns a sge.Posexplode() rather than sge.Explode() node. Then, sqlglot can handle the SQL translation as expected. Here is the draft PR: https://github.com/ibis-project/ibis/pull/8354/files
With that change, the translated SQL looks good to me:

In [1]: import ibis
   ...: bq = ibis.bigquery.connect(project_id="bigframes-dev")
   ...: table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")
   ...: ibis.to_sql(table.z.unnest(offset=True), dialect="bigquery")
Out[1]: 
SELECT
  IF(pos = pos_2, `z`, NULL) AS `z`,
  IF(pos = pos_2, pos_2, NULL) AS pos_2
FROM `bigframes-dev`.`ibis_gbq_testing`.`array_types` AS `t0`
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(`t0`.`z`)) - 1)) AS pos
CROSS JOIN UNNEST(`t0`.`z`) AS `z` WITH OFFSET AS pos_2
WHERE
  pos = pos_2
  OR (
    pos > (
      ARRAY_LENGTH(`t0`.`z`) - 1
    ) AND pos_2 = (
      ARRAY_LENGTH(`t0`.`z`) - 1
    )
  )

However, the ibis node type is not correct. The ibis assumes this node only returns a single value and ignore the pos_2 here. Here is the repro code:

In [3]: type(table.z.unnest(offset=True))
Out[3]: ibis.expr.types.numeric.FloatingColumn

In [7]: table.select(["scalar_column", table.z.unnest(offset=True)])
Out[7]: 
r0 := DatabaseTable: array_types
  x             array<!int64>
  y             array<!string>
  z             array<!float64>
  grouper       string
  scalar_column float64

Project[r0]
  scalar_column: r0.scalar_column
  z:             Unnest(r0.z, offset=True)

In [10]: table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 1
----> 1 table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2

File ~/src/forkibis2/ibis/expr/types/relations.py:859, in Table.__getattr__(self, key)
    854     hint = common_typos[key]
    855     raise AttributeError(
    856         f"{type(self).__name__} object has no attribute {key!r}, did you mean {hint!r}"
    857     )
--> 859 raise AttributeError(f"'Table' object has no attribute {key!r}")

AttributeError: 'Table' object has no attribute 'pos_2'

Any suggestions to correct the change so that ibis can return pos_2 as the offset?

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • I agree to follow this project's Code of Conduct
@cpcloud
Copy link
Member

cpcloud commented Feb 15, 2024

Let's work through the API design first before jumping into compiled SQL.

The problem we need to solve to expose offset functionality may or may not have anything to do with the details of the compiled SQL.

@cpcloud
Copy link
Member

cpcloud commented Feb 15, 2024

Thinking about this some more I think this functionality warrants a new API.

I'll propose one: ArrayValue.enumerate(start=0).

This API would be analogous to Python's built-in function enumerate which accepts an Iterable[T] and returns Iterable[tuple[int, T]].

We'd do something similar, so the full signature of enumerate would be (using pseudocode):

# for ArrayValue[T]
def enumerate(self, start: int = 0) -> dt.StructValue[index: int, value: T]:
    ...

Thoughts?

We should probably experiment with two different flavors of backends (BigQuery and DuckDB) to determine feasibility of compilation here.

@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 15, 2024

While the proposed enumerate is familier to Python users, it is important to consider how this will translate into SQL and work with database execution. Let's discuss a few concerns:

  1. Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?
  2. What's the equivalent API to table.select([table["grouper"], table.x.unnest(offset=True)])?
  3. Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents table.arr.unnest() AS SELECT EXPLODE(arr) FROM table and let explode_to_unnest sqlglot.transformer to compile it into SELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).

@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 15, 2024

In case you didn't see that, @tswast also proposed a new API at #7781 calling ArrayTable.as_table(offset) if we want to keep current unnest API unchanged.

@chelsea-lin
Copy link
Contributor Author

If we want to continue the implementation for unnest(offset=True) API, I am trying to find a solution to return a struct in sqlglot as you commented at #8354
The solution is asking ibis compiler returns SELECT (SELECT AS STRUCT POSEXPLODE(arr)) FROM table and sqlglot can convert it into SELECT (SELECT AS STRUCT x, pos) FROM table FROM table CROSS JOIN UNNEST(x). But sqlglot does not work as expected, and I am trying to thinking about the solution.

@jcrist jcrist added the bigquery The BigQuery backend label Feb 15, 2024
@NickCrews
Copy link
Contributor

driveby-ing: def enumerate(arr: ArrayValue) feels very close to def zip(arr: ArrayValue, *others: ArrayValue), and I have wanted zip before but did workarounds.

@cpcloud
Copy link
Member

cpcloud commented Feb 16, 2024

@NickCrews We do have array zip, not sure if you've seen that: https://ibis-project.org/reference/expression-collections#ibis.expr.types.arrays.ArrayValue.zip

@cpcloud
Copy link
Member

cpcloud commented Feb 21, 2024

@chelsea-lin I'll try to address your concerns here and provide a path forward:

Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?

Yes. Is that different from any other expression that takes input arguments?

What's the equivalent API to table.select([table["grouper"], table.x.unnest(offset=True)])?

table.select("grouper", table.x.enumerate())

Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents table.arr.unnest() AS SELECT EXPLODE(arr) FROM table and let explode_to_unnest sqlglot.transformer to compile it into SELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).

Start by implementing the enumerate API and don't worry about sqlglot until you can produce an Ibis expression.

Here's the approach I would take:

  1. Define a new ops.Value subclass (class Enumerate(Value)) in ibis/expr/operations/arrays.py
  2. Define an enumerate method on ArrayValue (in ibis/expr/types/arrays.py) that constructs ops.Enumerate(...).to_expr()
  3. Figure out the SQL syntax that corresponds to the semantics of the operation
  4. Figure out the sqlglot objects that correspond to that SQL syntax.

@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 22, 2024

@cpcloud thanks! The implementation looks good overall. My remaining question is about ops.Enumerate.
It needs to return a dt.Struct and expect a sqlglot expression equaling to STRUCT(POSEXPLODE(arr)). Unfortunately, this cannot be directly converted to BigQuery syntax correctly (please check tobymao/sqlglot#2978). Also the issue is not reasonable for sqlglot because this is also not a correct spark syntax. This is preventing class Enumerate(Value) from producing the correct SQL. I am not familiar with spark sql, so do you have any thoughts here?
Also, if we cannot find an equal sqlglot expression, could using class Enumerate(Relation) be a better solution?

@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 22, 2024

@cpcloud Trying to simple my previous question. If sqlglot cannot return a struct including arrays' items and positions, which dtype ops.Enumerate can return to respect the two columns from the generated SQL?

@cpcloud
Copy link
Member

cpcloud commented Feb 22, 2024

Ok, after experimenting in #8429, I don't think enumerate implemented like this is possible.

I'm going to take a different approach, and experiment with adding a Table.unnest(column) method instead, that will be the "tabular" unnest implementation for Ibis.

@chelsea-lin
Copy link
Contributor Author

Thanks @cpcloud for prioritizing the offset parameter! While working on the Table.unnest(column) implementation, could you please also consider another feature requests to add preserved_empty parameter. This would ensure empty arrays are retained as null after the unnest operation.
As for the implementation, we can replace sqlglot expression from EXPLODE(x) into EXPLODE_OUTER(x), or POSEXPLODE(x) into POSEXPLODE_OUTER(x) if offset=True sets.
Thanks again here!

@cpcloud cpcloud self-assigned this Feb 23, 2024
@jcrist jcrist moved this from backlog to todo in Ibis planning and roadmap Feb 23, 2024
@chelsea-lin
Copy link
Contributor Author

@cpcloud We've found a workaround solution (reliant on the feature in #8892). This code effectively unnests two columns sequentially, includes an offset ID, and handles empty arrays as null. Thanks to @TrevorBergeron for the helpful ideas!

import ibis
bq = ibis.bigquery.connect(project_id="bigframes-dev")
table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")

offset_array_id = "offset_array_0"
offset_array = ibis.generate_array(
    0, ibis.greatest(
        0, 
        ibis.least(table.x.length() - 1, table.y.length() - 1)
    )).name(offset_array_id)
table_w_offset_array = table.select(offset_array, table.x, table.y)

unnest_offset_id = "unnest_offset_0"
unnest_offset = table_w_offset_array[offset_array_id].unest().name(unnest_offset_id)
table_w_offset = table_w_offset_array.select(unnest_offset, table.x, table.y)

table_w_unnest = table_w_offset.select(
    table_w_offset[unnest_offset_id].name(unnest_offset_id),
    table_w_offset.x[table_w_offset[unnest_offset_id]].name("x"),
    table_w_offset.y[table_w_offset[unnest_offset_id]].name("y"),
)

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2024

This was closed by #9423 (Table.unnest), which supports an offset parameter.

@cpcloud cpcloud closed this as completed Jul 26, 2024
@cpcloud cpcloud added this to the 9.2 milestone Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery The BigQuery backend feature Features or general enhancements
Projects
Archived in project
Development

No branches or pull requests

4 participants