-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLOG: The "Polars vs pandas" difference nobody is talking about #843
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,167 @@ | ||||||||||||||||||||||
--- | ||||||||||||||||||||||
title: 'The "Polars vs pandas" difference nobody is talking about' | ||||||||||||||||||||||
authors: [marco-gorelli] | ||||||||||||||||||||||
published: May 10, 2024 | ||||||||||||||||||||||
description: 'A closer look at non-elementary group-by aggregations' | ||||||||||||||||||||||
category: [PyData ecosystem] | ||||||||||||||||||||||
--- | ||||||||||||||||||||||
|
||||||||||||||||||||||
# The "Polars vs pandas" difference nobody is talking about | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||
|
||||||||||||||||||||||
I attended PyData Berlin 2024 this week, and it was a blast! I met so many colleagues, collaborators, and friends. | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "this week"? |
||||||||||||||||||||||
There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! It's certainly nice to see | ||||||||||||||||||||||
people talking about it, and the focus tends to be on features such as: | ||||||||||||||||||||||
|
||||||||||||||||||||||
- lazy execution; | ||||||||||||||||||||||
- Rust; | ||||||||||||||||||||||
- consistent handling of null values; | ||||||||||||||||||||||
- multithreading; | ||||||||||||||||||||||
- query optimisation. | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations. | ||||||||||||||||||||||
|
||||||||||||||||||||||
I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations | ||||||||||||||||||||||
with both the pandas and Polars syntaxes. Finally, we'll look at non-elementary aggregations, and see | ||||||||||||||||||||||
how the Polars syntax enables so much more than the pandas one. | ||||||||||||||||||||||
|
||||||||||||||||||||||
## What's a group-by? | ||||||||||||||||||||||
|
||||||||||||||||||||||
Suppose we start with a dataframe such as: | ||||||||||||||||||||||
|
||||||||||||||||||||||
```python | ||||||||||||||||||||||
shape: (6, 3) | ||||||||||||||||||||||
┌─────┬─────┬─────┐ | ||||||||||||||||||||||
│ a ┆ b ┆ c │ | ||||||||||||||||||||||
│ --- ┆ --- ┆ --- │ | ||||||||||||||||||||||
│ i64 ┆ i64 ┆ i64 │ | ||||||||||||||||||||||
╞═════╪═════╪═════╡ | ||||||||||||||||||||||
│ 1 ┆ 4 ┆ 3 │ | ||||||||||||||||||||||
│ 1 ┆ 1 ┆ 1 │ | ||||||||||||||||||||||
│ 1 ┆ 2 ┆ 2 │ | ||||||||||||||||||||||
│ 2 ┆ 7 ┆ 8 │ | ||||||||||||||||||||||
│ 2 ┆ 6 ┆ 6 │ | ||||||||||||||||||||||
│ 2 ┆ 7 ┆ 7 │ | ||||||||||||||||||||||
└─────┴─────┴─────┘ | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
A group-by operation produces a single row per group: | ||||||||||||||||||||||
```python | ||||||||||||||||||||||
df.group_by('a').agg('b') | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
shape: (2, 2) | ||||||||||||||||||||||
┌─────┬───────────┐ | ||||||||||||||||||||||
│ a ┆ b │ | ||||||||||||||||||||||
│ --- ┆ --- │ | ||||||||||||||||||||||
│ i64 ┆ list[i64] │ | ||||||||||||||||||||||
╞═════╪═══════════╡ | ||||||||||||||||||||||
│ 2 ┆ [7, 6, 7] │ | ||||||||||||||||||||||
│ 1 ┆ [4, 1, 2] │ | ||||||||||||||||||||||
└─────┴───────────┘ | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
If we want a single scalar value per group, we can use a reduction ('mean', 'sum', 'std', ...): | ||||||||||||||||||||||
```python | ||||||||||||||||||||||
df.group_by('a').agg(pl.sum('b')) | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm already a little lost. Is this Polars or pandas? |
||||||||||||||||||||||
``` | ||||||||||||||||||||||
```python | ||||||||||||||||||||||
shape: (2, 2) | ||||||||||||||||||||||
┌─────┬─────┐ | ||||||||||||||||||||||
│ a ┆ b │ | ||||||||||||||||||||||
│ --- ┆ --- │ | ||||||||||||||||||||||
│ i64 ┆ i64 │ | ||||||||||||||||||||||
╞═════╪═════╡ | ||||||||||||||||||||||
│ 2 ┆ 20 │ | ||||||||||||||||||||||
│ 1 ┆ 7 │ | ||||||||||||||||||||||
└─────┴─────┘ | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
### Group-by in pandas | ||||||||||||||||||||||
|
||||||||||||||||||||||
If you're coming from a pandas-like library, you may have been used to writing the above example as | ||||||||||||||||||||||
|
||||||||||||||||||||||
```python | ||||||||||||||||||||||
df.groupby('a')['b'].sum() | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
For this task, that's a nice API: | ||||||||||||||||||||||
|
||||||||||||||||||||||
- select which columns you're grouping by; | ||||||||||||||||||||||
- select the column(s) you want to aggregate; | ||||||||||||||||||||||
- specify an elementary aggregation function. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Let's try something else: "find the maximum value of 'c', where 'b' is greater than its mean, per | ||||||||||||||||||||||
group 'a'". | ||||||||||||||||||||||
|
||||||||||||||||||||||
Unfortunately, the pandas API has no way to express this, meaning | ||||||||||||||||||||||
that no library which copies pandas can truly optimise such an | ||||||||||||||||||||||
operation in the general case. | ||||||||||||||||||||||
|
||||||||||||||||||||||
We could do | ||||||||||||||||||||||
```python | ||||||||||||||||||||||
df.groupby('a').apply( | ||||||||||||||||||||||
lambda df: df[df['b'] > df['b'].mean()]['c'].max() | ||||||||||||||||||||||
) | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
However, that uses a Python `lambda` function and so is generally going to be inefficient. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Another solution could be to do: | ||||||||||||||||||||||
```python | ||||||||||||||||||||||
df[df['b'] > df.groupby('a')['b'].transform('mean')].groupby('a')['c'].max() | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
This isn't too bad, but it involves doing two group-bys, and so is at least twice as slow as it could | ||||||||||||||||||||||
be. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Finally, can rely on `GroupBy` caching its groups, in-place mutation of the original dataframe, and the | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||
fact that `'max'` skips over missing values, to write: | ||||||||||||||||||||||
```python | ||||||||||||||||||||||
gb = df.groupby("a") | ||||||||||||||||||||||
mask = df["b"] > gb["b"].transform("mean") | ||||||||||||||||||||||
df["result"] = np.where(mask, df["c"], np.nan) | ||||||||||||||||||||||
gb["result"].max() | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
This works, but it involved quite some manual optimisation, and is definitely not a general solution. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Surely it's possible to do better? | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Non-elementary group-bys in Polars | ||||||||||||||||||||||
|
||||||||||||||||||||||
The Polars API lets us pass expressions to `GroupBy.agg`. So long as you can express your aggregation as | ||||||||||||||||||||||
an expression, you can use it in a group-by setting. In this case, we can express "the maximum value | ||||||||||||||||||||||
of 'c' where 'b' is greater than its mean" as | ||||||||||||||||||||||
```python | ||||||||||||||||||||||
pl.col('c').filter(pl.col('b') > pl.mean('b')).max() | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
Then, all we need to do is pass this expression to `GroupBy.agg`: | ||||||||||||||||||||||
|
||||||||||||||||||||||
```python | ||||||||||||||||||||||
df.group_by('a').agg( | ||||||||||||||||||||||
pl.col('c').filter(pl.col('b') > pl.mean('b')).max() | ||||||||||||||||||||||
) | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
Wonderful! We could express the operation cleanly using the library's own API, meaning that any implementation | ||||||||||||||||||||||
of the Polars API (not necessarily Polars itself) has the possibility to evaluate this efficiently. | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Conclusion, and a plea to future dataframe authors | ||||||||||||||||||||||
|
||||||||||||||||||||||
We've learned about the group-by operation, elementary aggregations in both pandas and Polars, and how | ||||||||||||||||||||||
Polars' syntax enables optimisation of non-elementary aggregations. | ||||||||||||||||||||||
|
||||||||||||||||||||||
pandas is a wonderful tool which solves a lot of real problems for a lot of real people. | ||||||||||||||||||||||
But please, stop copying its API. No matter how much effort you put into | ||||||||||||||||||||||
your implementation, if your API is limited and can't express non-elementary group-by aggregations, | ||||||||||||||||||||||
then you're going to run into a wall at some point. Good luck reverse-parsing the bytecode of a user's | ||||||||||||||||||||||
Python lambda function. | ||||||||||||||||||||||
|
||||||||||||||||||||||
On the other hand, if you innovate on the API side, you can enable new possibilities. There's a reason | ||||||||||||||||||||||
that | ||||||||||||||||||||||
|
||||||||||||||||||||||
> I came for the speed, but stayed for the syntax | ||||||||||||||||||||||
|
||||||||||||||||||||||
is a common refrain among Polars users. | ||||||||||||||||||||||
|
||||||||||||||||||||||
There may be a more general lesson here: if you have the courage to do things differently, you may be rewarded. | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line doesn't really connect for me with anything else in the blog post, seems to come out of nowhere. |
||||||||||||||||||||||
|
||||||||||||||||||||||
If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation | ||||||||||||||||||||||
using Polars, Quansight is here to help - you can get in touch [here](https://quansight.com/about-us/#bookacallform). | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if there's some way we could or should set this off from the rest of the blog post, something like the following (not saying we should do it exactly like this): And now, a small message from our fellow coworkers at Quansight Consulting... If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.