Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(api): refactor the implementation of windowing #9200

Merged
merged 13 commits into from
Jul 18, 2024

Conversation

chloeh13q
Copy link
Contributor

@chloeh13q chloeh13q commented May 15, 2024

Description of changes

Some design considerations:

  • Removed the original implementation of windowing to avoid confusion.
  • I thought about creating a window class to hold all the relevant windowing info like window type, window size, window slide, etc. The API would be slightly cleaner that way but I worry that it would be confused with the existing ibis.window().
  • Tumble and hop are the most basic types of windows that we should support across the board. Session windows and cumulate windows are not supported in every backend and the API is slightly different.

The new API:

>>> import ibis
>>> from ibis import _
>>> t = ibis.table(
...     ibis.schema(
...         {
...             "createTime": "timestamp(3)",
...             "orderId": "int64",
...             "payAmount": "float64",
...             "payPlatform": "int32",
...             "provinceId": "int32",
...         }
...     ),
...     name="payment_msg",
... )
>>> expr = (
...         t.window_by(time_col="createTime")
...         .tumble(size=ibis.interval(seconds=30))
...         .agg(by=["provinceId"], avgPayAmount=_.payAmount.mean())
...     )
>>> expr
r0 := UnboundTable: payment_msg
  createTime  timestamp(3)
  orderId     int64
  payAmount   float64
  payPlatform int32
  provinceId  int32

WindowAggregate[r0]
  window_type:
    tumble
  time_col:
    r0.createTime
  groups:
    provinceId: r0.provinceId
  metrics:
    avgPayAmount: Mean(r0.payAmount)
  window_size:
    30 s
  schema:
    window_start timestamp
    window_end   timestamp
    provinceId   int32
    avgPayAmount float64

Issues closed

#8847

@chloeh13q chloeh13q force-pushed the refactor/windowing branch from edae581 to 84bac37 Compare May 15, 2024 22:22
@chloeh13q chloeh13q force-pushed the refactor/windowing branch 3 times, most recently from fc5d2fd to 25fe86d Compare May 31, 2024 21:51
@chloeh13q chloeh13q marked this pull request as ready for review June 5, 2024 19:52
@cpcloud cpcloud added this to the 9.2 milestone Jun 13, 2024
@chloeh13q chloeh13q force-pushed the refactor/windowing branch 2 times, most recently from 00e09e8 to 6f14e43 Compare July 1, 2024 23:24
@cpcloud
Copy link
Member

cpcloud commented Jul 5, 2024

@chloeh13q This LGTM! Just to clarify, are there any user-facing API changes, or is it only internals that have changed?

@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase streaming Issue related to streaming APIs or backends labels Jul 5, 2024
@chloeh13q chloeh13q force-pushed the refactor/windowing branch from 769fb23 to dec2c92 Compare July 9, 2024 20:53
@chloeh13q
Copy link
Contributor Author

chloeh13q commented Jul 9, 2024

@cpcloud Thanks for reviewing!

are there any user-facing API changes, or is it only internals that have changed?

There are user-facing API changes, although the changes are pretty subtle. We're changing:

expr = (
    t.window_by(time_col="i")
    .tumble(window_size=ibis.interval(minutes=15))
    .group_by(["window_start", "window_end", "g"])
    .aggregate(mean=_.d.mean())
)

to something like

expr = (
    t.window_by(time_col="i")
    .tumble(window_size=ibis.interval(minutes=15))
    .group_by(["g"])
    .aggregate(mean=_.d.mean())
)

The big difference is that we're trying to hide columns like "window_start" "window_end" from the user (as well as providing more alternative syntax in a more Ibis-like manner, e.g., you can use both .agg and .aggregate, and you can put the groupby into aggregate as well now).

[ps: only implemented windowed aggregations so far to see if we like this new API and IR changes. Planning to implement window top-n, window join, and cascading windows (aka chained window aggregations) (note that we didn't have alternative syntax for these previously, but you could do them with underlying ops like row number, so these are going to be new APIs) in subsequent PRs]

With changes merged from another PR, I was able to add tests for window aggregations in the Spark backend too.

@chloeh13q chloeh13q requested review from cpcloud and jcrist July 9, 2024 21:07
@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2024

@chloeh13q Can you rebase? I'll give a final review and this should be good to go after that.

@chloeh13q chloeh13q force-pushed the refactor/windowing branch from b64029c to 34f61e0 Compare July 16, 2024 18:00
@chloeh13q
Copy link
Contributor Author

@cpcloud Just rebased!

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't catch the warnings usage before, let's clean that up and then this is good to go!

ibis/backends/pyspark/__init__.py Outdated Show resolved Hide resolved
ibis/backends/pyspark/__init__.py Outdated Show resolved Hide resolved
ibis/backends/pyspark/__init__.py Outdated Show resolved Hide resolved
@chloeh13q chloeh13q force-pushed the refactor/windowing branch from f565935 to f510d6a Compare July 16, 2024 23:26
@chloeh13q chloeh13q requested a review from cpcloud July 16, 2024 23:28
@chloeh13q
Copy link
Contributor Author

@cpcloud Thanks for the review and feedback! I changed all the warnings to errors.

@chloeh13q chloeh13q force-pushed the refactor/windowing branch from f510d6a to 28d34ee Compare July 16, 2024 23:29
@cpcloud cpcloud added the pyspark The Apache PySpark backend label Jul 18, 2024
@cpcloud cpcloud merged commit eaa1301 into ibis-project:main Jul 18, 2024
84 checks passed
@cpcloud
Copy link
Member

cpcloud commented Jul 18, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyspark The Apache PySpark backend refactor Issues or PRs related to refactoring the codebase streaming Issue related to streaming APIs or backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants