-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLOG: The "Polars vs pandas" difference nobody is talking about #843
Draft
MarcoGorelli
wants to merge
5
commits into
Quansight:main
Choose a base branch
from
MarcoGorelli:groupby
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+176
−0
Draft
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
--- | ||
title: 'The Polars vs pandas difference nobody is talking about' | ||
authors: [marco-gorelli] | ||
published: May 10, 2024 | ||
description: 'A closer look at non-elementary group-by aggregations' | ||
category: [PyData ecosystem] | ||
--- | ||
|
||
# The Polars vs pandas difference nobody is talking about | ||
|
||
I attended PyData Berlin 2024 in April, and it was a blast! I met so many colleagues, collaborators, and friends. | ||
There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! | ||
It's certainly nice to see people talking about it, and the focus tends to be on features such as: | ||
|
||
- lazy execution | ||
- Rust | ||
- consistent handling of null values | ||
- multithreading | ||
- query optimisation | ||
|
||
Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations. | ||
|
||
I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations | ||
with both the pandas and Polars APIs. Finally, we'll look at non-elementary aggregations, and see | ||
how the Polars API enables so much more than the pandas one. | ||
|
||
## What's a group-by? | ||
|
||
Suppose we start with a (Polars) dataframe such as: | ||
|
||
```python | ||
shape: (6, 3) | ||
┌─────┬─────┬─────┐ | ||
│ a ┆ b ┆ c │ | ||
│ --- ┆ --- ┆ --- │ | ||
│ i64 ┆ i64 ┆ i64 │ | ||
╞═════╪═════╪═════╡ | ||
│ 1 ┆ 4 ┆ 3 │ | ||
│ 1 ┆ 1 ┆ 1 │ | ||
│ 1 ┆ 2 ┆ 2 │ | ||
│ 2 ┆ 7 ┆ 8 │ | ||
│ 2 ┆ 6 ┆ 6 │ | ||
│ 2 ┆ 7 ┆ 7 │ | ||
└─────┴─────┴─────┘ | ||
``` | ||
|
||
A group-by operation produces a single row per group: | ||
```python | ||
df.group_by('a').agg('b') | ||
``` | ||
``` | ||
shape: (2, 2) | ||
┌─────┬───────────┐ | ||
│ a ┆ b │ | ||
│ --- ┆ --- │ | ||
│ i64 ┆ list[i64] │ | ||
╞═════╪═══════════╡ | ||
│ 2 ┆ [7, 6, 7] │ | ||
│ 1 ┆ [4, 1, 2] │ | ||
└─────┴───────────┘ | ||
``` | ||
|
||
If we want a single scalar value per group, we can use a reduction ('mean', 'sum', 'std', ...): | ||
```python | ||
df.group_by('a').agg(pl.sum('b')) | ||
``` | ||
```python | ||
shape: (2, 2) | ||
┌─────┬─────┐ | ||
│ a ┆ b │ | ||
│ --- ┆ --- │ | ||
│ i64 ┆ i64 │ | ||
╞═════╪═════╡ | ||
│ 2 ┆ 20 │ | ||
│ 1 ┆ 7 │ | ||
└─────┴─────┘ | ||
``` | ||
|
||
### Group-by in pandas | ||
|
||
If you're coming from a pandas-like library, you may have been used to writing the above example as | ||
|
||
```python | ||
df.groupby('a')['b'].sum() | ||
``` | ||
|
||
And indeed, for such a simple task as this one, the pandas API is quite nice. All we had to do was: | ||
|
||
1. select which columns we're grouping by | ||
2. select the column(s) we want to aggregate | ||
3. specify an elementary aggregation function | ||
|
||
Let's try something else: "find the maximum value of 'c', where 'b' is greater than its mean, per | ||
group 'a'". | ||
|
||
Unfortunately, the pandas API has no way to express this, meaning | ||
that no library which copies the pandas API can truly optimise such an | ||
operation in the general case. | ||
|
||
You might be wondering whether we can just do the following: | ||
```python | ||
df.groupby('a').apply( | ||
lambda df: df[df['b'] > df['b'].mean()]['c'].max() | ||
) | ||
``` | ||
|
||
However, that uses a Python `lambda` function and so is generally going to be inefficient. | ||
|
||
Another solution one might come up with is this one: | ||
```python | ||
df[df['b'] > df.groupby('a')['b'].transform('mean')].groupby('a')['c'].max() | ||
``` | ||
It's not as bad as the `apply` solution above, but it still looks overly complicated and requires | ||
two group-bys. | ||
|
||
There's actually a third solution, which: | ||
|
||
- relies on `GroupBy` caching its groups | ||
- performs in-place mutation of the original dataframe | ||
- uses the fact that `'max'` skips over missing values | ||
|
||
Realistically, few users would come up with it (most would jump straight to `apply`), but for | ||
completeness, we present it: | ||
```python | ||
gb = df.groupby("a") | ||
mask = df["b"] > gb["b"].transform("mean") | ||
df["result"] = np.where(mask, df["c"], np.nan) | ||
gb["result"].max() | ||
``` | ||
|
||
Surely it's possible to do better? | ||
|
||
## Non-elementary group-bys in Polars | ||
|
||
The Polars API lets us pass [expressions](https://docs.pola.rs/user-guide/concepts/expressions/) to `GroupBy.agg`. | ||
So long as you can express your aggregation as | ||
an expression, you can use it in a group-by setting. In this case, we can express "the maximum value | ||
of 'c' where 'b' is greater than its mean" as | ||
```python | ||
pl.col('c').filter(pl.col('b') > pl.mean('b')).max() | ||
``` | ||
Then, all we need to do is pass this expression to `GroupBy.agg`: | ||
|
||
```python | ||
df.group_by('a').agg( | ||
pl.col('c').filter(pl.col('b') > pl.mean('b')).max() | ||
) | ||
``` | ||
Wonderful! Like this, we can express the operation cleanly and without hacks, meaning that any dataframe | ||
implementation which follows the Polars API has the possibility to evaluate this efficiently. | ||
|
||
## Conclusion, and a plea to future dataframe authors | ||
|
||
We've learned about the group-by operation, elementary aggregations in both pandas and Polars, and how | ||
Polars' syntax enables users to cleanly express non-elementary aggregations. | ||
|
||
pandas is a wonderful tool which solves a lot of real problems for a lot of real people. | ||
However, by insisting on copying its API, the dataframe landscape is stunting its own potential. | ||
If an API doesn't allow for an operation to be expressed and users end up using `apply` with custom | ||
Python lambda functions, then no amount of acceleration is going to make up for that. | ||
|
||
On the other hand, innovation on the API side can enable new possibilities. There's a reason | ||
that | ||
|
||
> I came for the speed, but stayed for the syntax | ||
|
||
is a common refrain among Polars users. | ||
|
||
For years, the unwritten consensus among dataframe library authors was that if you wanted to | ||
get adoption, then you needed to mimic the pandas API. However, Polars' success calls that into question. | ||
There may be a more general lesson to be learned here: try doing things differently, and then maybe - just | ||
maybe - you might be rewarded. | ||
|
||
If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation | ||
using Polars, Quansight is here to help - [please get in touch](https://quansight.com/about-us/#bookacallform) - | ||
we'd love to hear from you. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm already a little lost. Is this Polars or pandas?