Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(pandas): port the pandas backend with an improved execution model #7797

Merged
merged 1 commit into from
Jan 4, 2024

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Dec 18, 2023

Old Implementation

Since we need to reimplement/port all of the backends for #7752, I took an
attempt at reimplementing the pandas backend using a new execution engine.
Previously the pandas backend was implemented using a top-down execution model
and each operation was executing using a multidispatched function. While it
served us well for a long time, it had a few drawbacks:

  • it was often hard to understand what was going on due to the complex
    preparation steps and various execution hooks
  • the multidispatched functions were hard to debug, additionally they supported
    a wide variety of inputs making the implementation rather bulky
  • due to the previous reaon, several inputs combinations were not supported,
    e.g. value operations with multiple columnar inputs
  • the Scope object was used to pass around the execution context which was
    created for each operation separately and the results were not reusable even
    though the same operation was executed multiple times

New Implementation

The new execution model has changed in several ways:

  • there is a rewrite layer before execution which lowers the input expression
    to a form closer to the pandas execution model, this makes it much easier to
    implement the operations and also makes the input "plan" inspectable
  • the execution is now topologically sorted and executed in a bottom-up manner;
    the intermediate results are reused, making the execution more efficient while
    also aggressively cleaned up as soon as they are not needed anymore to reduce
    the memory usage
  • the execute function is now single-dispatched making the implementation
    easier to locate and debug
  • the inputs now broadcasted to columnar shape so that the same implementation
    can be used for multiple input shape combinations, this removes several
    special cases from the implementation in exchange of a negligible performance
    overhead
  • there are helper utilities making it easier to implement compute kernels for
    the various value operations: rowwise, columnwise, elementwise,
    serieswise; if there are multiple implementations available for a given
    operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the timecontext feature is not supported anymore

.github/workflows/ibis-backends.yml Outdated Show resolved Hide resolved
ibis/backends/base/df/timecontext.py Outdated Show resolved Hide resolved
ibis/formats/pandas.py Show resolved Hide resolved
@@ -570,7 +570,8 @@ def test_elementwise_udf_named_destruct(udf_alltypes):
add_one_struct_udf = create_add_one_struct_udf(
result_formatter=lambda v1, v2: (v1, v2)
)
with pytest.raises(com.IbisTypeError, match=r"Unable to infer"):
msg = "Duplicate column name 'new_struct' in result set"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's time we get rid of this now obsolete destructure API. At this point it's entirely supplanted by lift (column expression) and unpack (tabular expression).

ibis/backends/tests/test_interactive.py Show resolved Hide resolved
if left is None or left is pd.NA:
return None
elif right is None or right is pd.NA:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we may want a decorator like

@null_safe
def merge(left, right):
    return {**left, **right}


def null_safe(f):
    def wrapper(*args):
        if any(arg is None or arg is pd.NA):
            return None
        return f(*args)
    return wrapper

ibis/backends/pandas/executor/generic.py Outdated Show resolved Hide resolved
ibis/backends/pandas/executor/nested.py Outdated Show resolved Hide resolved
ibis/backends/pandas/executor/strings.py Outdated Show resolved Hide resolved
ibis/backends/pandas/executor/windows.py Outdated Show resolved Hide resolved
@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase pandas The pandas backend labels Dec 28, 2023
@kszucs kszucs force-pushed the tes-pandas branch 2 times, most recently from 8f113a2 to 6c72e3e Compare December 28, 2023 21:18
ibis/common/dispatch.py Outdated Show resolved Hide resolved
ibis/expr/types/joins.py Outdated Show resolved Hide resolved
@kszucs kszucs changed the title refactor(pandas): rewrite the pandas backend to use the new relational operations with an improved execution model refactor(pandas): port the pandas backend with an improved execution model Dec 29, 2023
@kszucs kszucs force-pushed the tes-pandas branch 3 times, most recently from 5e480fd to 1ea8358 Compare January 1, 2024 01:03
@cpcloud cpcloud enabled auto-merge (squash) January 4, 2024 11:43
@cpcloud cpcloud merged commit eb31002 into ibis-project:the-epic-split Jan 4, 2024
25 checks passed
@kszucs kszucs deleted the tes-pandas branch January 4, 2024 11:51
cpcloud pushed a commit that referenced this pull request Jan 4, 2024
…model (#7797)

## Old Implementation

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex 
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times 

## New Implementation

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`, 
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the 
implementation is one third of the size of the previous one. 

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit that referenced this pull request Jan 5, 2024
…model (#7797)

## Old Implementation

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex 
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times 

## New Implementation

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`, 
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the 
implementation is one third of the size of the previous one. 

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit that referenced this pull request Jan 12, 2024
…model (#7797)

## Old Implementation

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex 
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times 

## New Implementation

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`, 
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the 
implementation is one third of the size of the previous one. 

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit that referenced this pull request Jan 13, 2024
…model (#7797)

## Old Implementation

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex 
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times 

## New Implementation

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`, 
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the 
implementation is one third of the size of the previous one. 

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit that referenced this pull request Jan 17, 2024
…model (#7797)

## Old Implementation

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex 
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times 

## New Implementation

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`, 
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the 
implementation is one third of the size of the previous one. 

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 1, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 1, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 1, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 2, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 2, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit to kszucs/ibis that referenced this pull request Feb 2, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 4, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 5, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit that referenced this pull request Feb 5, 2024
…model (#7797)

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit that referenced this pull request Feb 6, 2024
…model (#7797)

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit that referenced this pull request Feb 6, 2024
…model (#7797)

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 12, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit that referenced this pull request Feb 12, 2024
…model (#7797)

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit to cpcloud/ibis that referenced this pull request Feb 12, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
cpcloud pushed a commit that referenced this pull request Feb 12, 2024
…model (#7797)

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
kszucs added a commit that referenced this pull request Feb 12, 2024
…model (#7797)

Since we need to reimplement/port all of the backends for #7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
ncclementi pushed a commit to ncclementi/ibis that referenced this pull request Feb 21, 2024
…model (ibis-project#7797)

Since we need to reimplement/port all of the backends for ibis-project#7752, I took
an
attempt at reimplementing the pandas backend using a new execution
engine.
Previously the pandas backend was implemented using a top-down execution
model
and each operation was executing using a multidispatched function. While
it
served us well for a long time, it had a few drawbacks:
- it was often hard to understand what was going on due to the complex
  preparation steps and various execution hooks
- the multidispatched functions were hard to debug, additionally they
supported
  a wide variety of inputs making the implementation rather bulky
- due to the previous reaon, several inputs combinations were not
supported,
  e.g. value operations with multiple columnar inputs
- the `Scope` object was used to pass around the execution context which
was
created for each operation separately and the results were not reusable
even
  though the same operation was executed multiple times

The new execution model has changed in several ways:
- there is a rewrite layer before execution which lowers the input
expression
to a form closer to the pandas execution model, this makes it much
easier to
  implement the operations and also makes the input "plan" inspectable
- the execution is now topologically sorted and executed in a bottom-up
manner;
the intermediate results are reused, making the execution more efficient
while
also aggressively cleaned up as soon as they are not needed anymore to
reduce
  the memory usage
- the execute function is now single-dispatched making the
implementation
  easier to locate and debug
- the inputs now broadcasted to columnar shape so that the same
implementation
can be used for multiple input shape combinations, this removes several
special cases from the implementation in exchange of a negligible
performance
  overhead
- there are helper utilities making it easier to implement compute
kernels for
  the various value operations: `rowwise`, `columnwise`, `elementwise`,
`serieswise`; if there are multiple implementations available for a
given
operation, the most efficient one is selected based on the input shapes

The new backend implementation has a higher feature coverage while the
implementation is one third of the size of the previous one.

BREAKING CHANGE: the `timecontext` feature is not supported anymore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas The pandas backend refactor Issues or PRs related to refactoring the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants