Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(dask): port the dask backend to the new execution model #8005

Merged
merged 1 commit into from
Feb 1, 2024

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Jan 17, 2024

Reimplementation of the dask backend on top of the new pandas executor. I had to adjust the pandas backend to support extending. This way the new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper dask constructs, but rather have a fallback to local execution using pandas. The most notable are the window functions. The previous dask implementation supported just a couple of window cases, but this way we have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the removed xfails in the test suite.

@kszucs kszucs changed the title refactor(dask): initial commit for porting the dask backend to the ne… refactor(dask): initial commit for porting the dask backend to the new execution model Jan 17, 2024
@kszucs kszucs changed the title refactor(dask): initial commit for porting the dask backend to the new execution model refactor(dask): initial commit for porting the dask backend to the new execution model [WIP] Jan 17, 2024
@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase dask The Dask backend tes-required-for-release Things that must be addressed before *release* of `main` after merging in `the-epic-split` tes-required-for-merge Issues that must addressed before merging the-epic-split branch into main labels Jan 17, 2024
ibis/backends/dask/helpers.py Outdated Show resolved Hide resolved
@kszucs kszucs force-pushed the tes-dask branch 4 times, most recently from 0f1dec5 to f4abfb5 Compare January 18, 2024 17:15
@kszucs kszucs changed the title refactor(dask): initial commit for porting the dask backend to the new execution model [WIP] refactor(dask): port the dask backend to the new execution model [WIP] Jan 18, 2024
@kszucs kszucs force-pushed the tes-dask branch 14 times, most recently from 28850d7 to 87e14ee Compare January 25, 2024 13:06
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments. Overall I'd be very happy to get this in and then iterate. This is much cleaner than the current dask backend, thanks for all the work here!

I do think many of these places where we're explicitly calling .compute() as part of handling an operation are places where we should just drop support for that operation until we can spell it in a more efficient way (for many of these dask itself can do so efficiently, so the limit is more on our executor design than on dask). I'd rather have an UnsupportedOperationError for now than something that is made inefficient by something ibis is doing rather than the backend itself.

ibis/backends/dask/executor.py Show resolved Hide resolved
ibis/backends/dask/executor.py Show resolved Hide resolved
ibis/backends/dask/helpers.py Show resolved Hide resolved
):
def agg(df):
# if df is a dask dataframe then we collect it to a pandas dataframe
# because the user-defined function expects a pandas dataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that computing here like this would be unexpected and not something we should do.

IIUC these are the legacy udfs. IMO we should remove these as part of this release. They were only supported by the pandas and dask backends, and given the rewrite changed a bunch of the internals (and the only major user of these we know of made heavy use of the internals) I suspect there's no user that would be able to upgrade to this release that was also using the legacy udfs. We should drop them for them IMO for the new udfs alone.

ibis/backends/dask/executor.py Show resolved Hide resolved
ibis/backends/dask/executor.py Show resolved Hide resolved
@cpcloud
Copy link
Member

cpcloud commented Feb 1, 2024

You can xfail test_unnest_range for now, it's a new test since the last rebase of the-epic-split.

@kszucs
Copy link
Member Author

kszucs commented Feb 1, 2024

I do think many of these places where we're explicitly calling .compute() as part of handling an operation are places where we should just drop support for that operation until we can spell it in a more efficient way (for many of these dask itself can do so efficiently, so the limit is more on our executor design than on dask). I'd rather have an UnsupportedOperationError for now than something that is made inefficient by something ibis is doing rather than the backend itself.

On one hand I agree, simply because it should be the ideal way to go. On the other hand at least these operations are usable, even though they are not as performant as they should be. Also some of the window functionality used to be supported, so raising would cause a regression. There is a third option is to raise a warning so that the user is aware about the performance problems and gradually improve them (if there is interest).

@cpcloud cpcloud merged commit d87bf3e into ibis-project:the-epic-split Feb 1, 2024
66 checks passed
@jcrist
Copy link
Member

jcrist commented Feb 1, 2024

On the other hand at least these operations are usable, even though they are not as performant as they should be. Also some of the window functionality used to be supported, so raising would cause a regression.

I'd argue that the previous dask backend was both buggy and slow enough that no one is probably using it (we certainly haven't heard from any users successfully using it). A regression in functionality with an improvement in correctness of dask code (including not calling .compute() multiple times) feels worth it to me. I consider anywhere we're manually calling compute outside of returning results to the user a bug.

kszucs added a commit to kszucs/ibis that referenced this pull request Feb 1, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 2, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 2, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 2, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 4, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 5, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
kszucs added a commit that referenced this pull request Feb 5, 2024
Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
kszucs added a commit that referenced this pull request Feb 6, 2024
Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
kszucs added a commit that referenced this pull request Feb 6, 2024
Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 12, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
cpcloud pushed a commit that referenced this pull request Feb 12, 2024
Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 12, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
cpcloud pushed a commit that referenced this pull request Feb 12, 2024
Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
kszucs added a commit that referenced this pull request Feb 12, 2024
Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
ncclementi pushed a commit to ncclementi/ibis that referenced this pull request Feb 21, 2024
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask The Dask backend refactor Issues or PRs related to refactoring the codebase tes-required-for-merge Issues that must addressed before merging the-epic-split branch into main
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants