refactor(pandas): port the pandas backend with an improved execution model #7797

kszucs · 2023-12-18T20:31:24Z

Old Implementation

Since we need to reimplement/port all of the backends for #7752, I took an
attempt at reimplementing the pandas backend using a new execution engine.
Previously the pandas backend was implemented using a top-down execution model
and each operation was executing using a multidispatched function. While it
served us well for a long time, it had a few drawbacks:

it was often hard to understand what was going on due to the complex
preparation steps and various execution hooks
the multidispatched functions were hard to debug, additionally they supported
a wide variety of inputs making the implementation rather bulky
due to the previous reaon, several inputs combinations were not supported,
e.g. value operations with multiple columnar inputs
the Scope object was used to pass around the execution context which was
created for each operation separately and the results were not reusable even
though the same operation was executed multiple times

New Implementation

The new execution model has changed in several ways:

there is a rewrite layer before execution which lowers the input expression
to a form closer to the pandas execution model, this makes it much easier to
implement the operations and also makes the input "plan" inspectable
the execution is now topologically sorted and executed in a bottom-up manner;
the intermediate results are reused, making the execution more efficient while
also aggressively cleaned up as soon as they are not needed anymore to reduce
the memory usage
the execute function is now single-dispatched making the implementation
easier to locate and debug
the inputs now broadcasted to columnar shape so that the same implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible performance
overhead
there are helper utilities making it easier to implement compute kernels for
the various value operations: rowwise, columnwise, elementwise,
serieswise; if there are multiple implementations available for a given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the timecontext feature is not supported anymore

.github/workflows/ibis-backends.yml

ibis/backends/base/df/timecontext.py

ibis/formats/pandas.py

cpcloud · 2023-12-28T15:49:38Z

ibis/backends/tests/test_vectorized_udf.py

@@ -570,7 +570,8 @@ def test_elementwise_udf_named_destruct(udf_alltypes):
    add_one_struct_udf = create_add_one_struct_udf(
        result_formatter=lambda v1, v2: (v1, v2)
    )
-    with pytest.raises(com.IbisTypeError, match=r"Unable to infer"):
+    msg = "Duplicate column name 'new_struct' in result set"


I think it's time we get rid of this now obsolete destructure API. At this point it's entirely supplanted by lift (column expression) and unpack (tabular expression).

ibis/backends/tests/test_interactive.py

cpcloud · 2023-12-28T16:12:16Z

ibis/backends/pandas/executor/nested.py

+    if left is None or left is pd.NA:
+        return None
+    elif right is None or right is pd.NA:
+        return None


Eventually we may want a decorator like

@null_safe def merge(left, right): return {**left, **right} def null_safe(f): def wrapper(*args): if any(arg is None or arg is pd.NA): return None return f(*args) return wrapper

ibis/backends/pandas/executor/generic.py

ibis/backends/pandas/executor/nested.py

ibis/backends/pandas/executor/strings.py

ibis/backends/pandas/executor/windows.py

ibis/common/dispatch.py

ibis/expr/operations/relations.py

ibis/expr/types/joins.py

ibis/backends/pandas/tests/test_join.py

ibis/expr/tests/test_newrels.py

…model (#7797) ## Old Implementation Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times ## New Implementation The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore

kszucs force-pushed the the-epic-split branch from d174f3e to f3ee1a6 Compare December 20, 2023 20:58

kszucs force-pushed the tes-pandas branch 3 times, most recently from 7c345c3 to 1c88b6c Compare December 21, 2023 17:06

kszucs force-pushed the the-epic-split branch from df63f39 to 593ec6a Compare December 22, 2023 22:12

kszucs force-pushed the tes-pandas branch from 208da9d to 351a0d9 Compare December 22, 2023 22:15

cpcloud force-pushed the tes-pandas branch from 351a0d9 to 50ed494 Compare December 23, 2023 15:02

cpcloud force-pushed the the-epic-split branch from 7f0c102 to f30e7ad Compare December 26, 2023 11:59

kszucs force-pushed the the-epic-split branch 2 times, most recently from 6375df9 to 931e546 Compare December 28, 2023 13:07

kszucs force-pushed the tes-pandas branch from 50ed494 to 1ec67a0 Compare December 28, 2023 13:07

cpcloud reviewed Dec 28, 2023

View reviewed changes

cpcloud added refactor Issues or PRs related to refactoring the codebase pandas The pandas backend labels Dec 28, 2023

kszucs force-pushed the tes-pandas branch 2 times, most recently from 8f113a2 to 6c72e3e Compare December 28, 2023 21:18

kszucs commented Dec 29, 2023

View reviewed changes

ibis/common/dispatch.py Outdated Show resolved Hide resolved

kszucs commented Dec 29, 2023

View reviewed changes

ibis/expr/operations/relations.py Outdated Show resolved Hide resolved

kszucs commented Dec 29, 2023

View reviewed changes

ibis/expr/types/joins.py Outdated Show resolved Hide resolved

kszucs mentioned this pull request Dec 29, 2023

feat(common): add Dispatched base class for convenient visitor pattern implementation #7857

Merged

kszucs changed the title ~~refactor(pandas): rewrite the pandas backend to use the new relational operations with an improved execution model~~ refactor(pandas): port the pandas backend with an improved execution model Dec 29, 2023

This was referenced Dec 29, 2023

feat(common): add a memory efficient Node.map() implementation #7862

Closed

feat(common): add a memory efficient Node.map() implementation #7863

Merged

kszucs force-pushed the tes-pandas branch from cf48cdb to d668ad5 Compare December 29, 2023 18:26

kszucs commented Dec 29, 2023

View reviewed changes

ibis/backends/pandas/tests/test_join.py Outdated Show resolved Hide resolved

kszucs force-pushed the tes-pandas branch from 23863c4 to 98ac08f Compare December 30, 2023 11:38

kszucs commented Dec 30, 2023

View reviewed changes

ibis/expr/tests/test_newrels.py Outdated Show resolved Hide resolved

kszucs force-pushed the tes-pandas branch 3 times, most recently from 5e480fd to 1ea8358 Compare January 1, 2024 01:03

cpcloud approved these changes Jan 4, 2024

View reviewed changes

cpcloud enabled auto-merge (squash) January 4, 2024 11:43

cpcloud merged commit eb31002 into ibis-project:the-epic-split Jan 4, 2024
25 checks passed

kszucs deleted the tes-pandas branch January 4, 2024 11:51

jcrist mentioned this pull request Jan 4, 2024

meta: Port backends to new relational operators/sqlglot #7909

Closed

21 tasks

kszucs mentioned this pull request Jan 18, 2024

refactor(pandas): simplify pandas helpers #8009

Merged

gforsyth mentioned this pull request Jan 23, 2024

test(markers): test the test markers #8077

Merged

kszucs mentioned this pull request Feb 1, 2024

chore: check builds for the-epic-split after rebase #8178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(pandas): port the pandas backend with an improved execution model #7797

refactor(pandas): port the pandas backend with an improved execution model #7797

kszucs commented Dec 18, 2023 •

edited

Loading

cpcloud Dec 28, 2023

cpcloud Dec 28, 2023

refactor(pandas): port the pandas backend with an improved execution model #7797

refactor(pandas): port the pandas backend with an improved execution model #7797

Conversation

kszucs commented Dec 18, 2023 • edited Loading

Old Implementation

New Implementation

cpcloud Dec 28, 2023

Choose a reason for hiding this comment

cpcloud Dec 28, 2023

Choose a reason for hiding this comment

kszucs commented Dec 18, 2023 •

edited

Loading