Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

"Predicate pushdown" in group-bys #7

Open
wesm opened this issue Aug 31, 2016 · 2 comments
Open

"Predicate pushdown" in group-bys #7

wesm opened this issue Aug 31, 2016 · 2 comments

Comments

@wesm
Copy link
Owner

wesm commented Aug 31, 2016

xref #15

I brought this up at SciPy 2015, but there's a significant performance win available in expressions like:

df[boolean_cond].groupby(grouping_exprs).agg(agg_expr)

If you do this currently, it will produce a fully materialized copy of df even if the groupby only touches a small portion of the DataFrame. Ideally, we'd have:

df.groupby(grouping_exprs, where=boolean_cond).agg(...)

I put this as a design / pandas2 issue because the boolean bytes / bits will need to get pushed down into the various C-level groupby subroutines.

@chris-b1
Copy link

chris-b1 commented Sep 1, 2016

On the mailing list, you mentioned the idea of an "expression VM", this feels like the kind of
thing that would be nicely handled by that? Just making up an API, something like this, where a delayed df builds up a dask/numexpr like graph that can be optimized.

df = pd.read_csv(...)
with pd.delayed(df) as df:
   df['val'] = df['val'] + 100.
   <... several intermediate expressions ... >
   answer = df[cond].groupby(expr).agg(...).compute()

# `df` is unmodified, only `answer` is computed, hopefully very efficiently 

Although that's really broad so maybe this is a useful enough case to just build directly into groupby ops.

@wesm
Copy link
Owner Author

wesm commented Sep 1, 2016

Yeah, the idea behind an "expression VM" is similar to the design of APL interpreters. This is a bigger topic than this issue, but normal pandas operations would be implemented through the eager evaluation of operators in pandas's internal set of functions. Once you have multiple operators you can begin to think about optimizing the evaluation or rearranging the query plan. SFrame (RIP?) notably does this

https://github.com/turi-code/SFrame/tree/master/oss_src/sframe_query_engine

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants