Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLOG: The "Polars vs pandas" difference nobody is talking about #843

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 167 additions & 0 deletions apps/labs/posts/dataframe-group-by.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
title: 'The "Polars vs pandas" difference nobody is talking about'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: 'The "Polars vs pandas" difference nobody is talking about'
title: 'The Polars vs pandas difference nobody is talking about'

authors: [marco-gorelli]
published: May 10, 2024
description: 'A closer look at non-elementary group-by aggregations'
category: [PyData ecosystem]
---

# The "Polars vs pandas" difference nobody is talking about
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The "Polars vs pandas" difference nobody is talking about
# The Polars vs pandas difference nobody is talking about


I attended PyData Berlin 2024 this week, and it was a blast! I met so many colleagues, collaborators, and friends.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this week"?

There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! It's certainly nice to see
people talking about it, and the focus tends to be on features such as:

- lazy execution;
- Rust;
- consistent handling of null values;
- multithreading;
- query optimisation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- lazy execution;
- Rust;
- consistent handling of null values;
- multithreading;
- query optimisation.
- lazy execution
- Rust
- consistent handling of null values
- multithreading
- query optimisation


Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations.

I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations
with both the pandas and Polars syntaxes. Finally, we'll look at non-elementary aggregations, and see
how the Polars syntax enables so much more than the pandas one.

## What's a group-by?

Suppose we start with a dataframe such as:

```python
shape: (6, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 4 ┆ 3 │
│ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 2 ┆ 2 │
│ 2 ┆ 7 ┆ 8 │
│ 2 ┆ 6 ┆ 6 │
│ 2 ┆ 7 ┆ 7 │
└─────┴─────┴─────┘
```

A group-by operation produces a single row per group:
```python
df.group_by('a').agg('b')
```
```
shape: (2, 2)
┌─────┬───────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 2 ┆ [7, 6, 7] │
│ 1 ┆ [4, 1, 2] │
└─────┴───────────┘
```

If we want a single scalar value per group, we can use a reduction ('mean', 'sum', 'std', ...):
```python
df.group_by('a').agg(pl.sum('b'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm already a little lost. Is this Polars or pandas?

```
```python
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 20 │
│ 1 ┆ 7 │
└─────┴─────┘
```

### Group-by in pandas

If you're coming from a pandas-like library, you may have been used to writing the above example as

```python
df.groupby('a')['b'].sum()
```

For this task, that's a nice API:

- select which columns you're grouping by;
- select the column(s) you want to aggregate;
- specify an elementary aggregation function.

Let's try something else: "find the maximum value of 'c', where 'b' is greater than its mean, per
group 'a'".

Unfortunately, the pandas API has no way to express this, meaning
that no library which copies pandas can truly optimise such an
operation in the general case.

We could do
```python
df.groupby('a').apply(
lambda df: df[df['b'] > df['b'].mean()]['c'].max()
)
```

However, that uses a Python `lambda` function and so is generally going to be inefficient.

Another solution could be to do:
```python
df[df['b'] > df.groupby('a')['b'].transform('mean')].groupby('a')['c'].max()
```
This isn't too bad, but it involves doing two group-bys, and so is at least twice as slow as it could
be.

Finally, can rely on `GroupBy` caching its groups, in-place mutation of the original dataframe, and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Finally, can rely on `GroupBy` caching its groups, in-place mutation of the original dataframe, and the
Finally, we can rely on `GroupBy` caching its groups, in-place mutation of the original dataframe, and the

fact that `'max'` skips over missing values, to write:
```python
gb = df.groupby("a")
mask = df["b"] > gb["b"].transform("mean")
df["result"] = np.where(mask, df["c"], np.nan)
gb["result"].max()
```
This works, but it involved quite some manual optimisation, and is definitely not a general solution.

Surely it's possible to do better?

## Non-elementary group-bys in Polars

The Polars API lets us pass expressions to `GroupBy.agg`. So long as you can express your aggregation as
an expression, you can use it in a group-by setting. In this case, we can express "the maximum value
of 'c' where 'b' is greater than its mean" as
```python
pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
```
Then, all we need to do is pass this expression to `GroupBy.agg`:

```python
df.group_by('a').agg(
pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
)
```
Wonderful! We could express the operation cleanly using the library's own API, meaning that any implementation
of the Polars API (not necessarily Polars itself) has the possibility to evaluate this efficiently.

## Conclusion, and a plea to future dataframe authors

We've learned about the group-by operation, elementary aggregations in both pandas and Polars, and how
Polars' syntax enables optimisation of non-elementary aggregations.

pandas is a wonderful tool which solves a lot of real problems for a lot of real people.
But please, stop copying its API. No matter how much effort you put into
your implementation, if your API is limited and can't express non-elementary group-by aggregations,
then you're going to run into a wall at some point. Good luck reverse-parsing the bytecode of a user's
Python lambda function.

On the other hand, if you innovate on the API side, you can enable new possibilities. There's a reason
that

> I came for the speed, but stayed for the syntax

is a common refrain among Polars users.

There may be a more general lesson here: if you have the courage to do things differently, you may be rewarded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't really connect for me with anything else in the blog post, seems to come out of nowhere.


If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation
using Polars, Quansight is here to help - you can get in touch [here](https://quansight.com/about-us/#bookacallform).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's some way we could or should set this off from the rest of the blog post, something like the following (not saying we should do it exactly like this):

And now, a small message from our fellow coworkers at Quansight Consulting... If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation
using Polars, please get in touch.