Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLOG: The "Polars vs pandas" difference nobody is talking about #843

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions apps/labs/posts/dataframe-group-by.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
title: 'The Polars vs pandas difference nobody is talking about'
authors: [marco-gorelli]
published: May 10, 2024
description: 'A closer look at non-elementary group-by aggregations'
category: [PyData ecosystem]
---

# The Polars vs pandas difference nobody is talking about

I attended PyData Berlin 2024 in April, and it was a blast! I met so many colleagues, collaborators, and friends.
There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner!
It's certainly nice to see people talking about it, and the focus tends to be on features such as:

- lazy execution
- Rust
- consistent handling of null values
- multithreading
- query optimisation

Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations.

I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations
with both the pandas and Polars APIs. Finally, we'll look at non-elementary aggregations, and see
how the Polars API enables so much more than the pandas one.

## What's a group-by?

Suppose we start with a (Polars) dataframe such as:

```python
shape: (6, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 4 ┆ 3 │
│ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 2 ┆ 2 │
│ 2 ┆ 7 ┆ 8 │
│ 2 ┆ 6 ┆ 6 │
│ 2 ┆ 7 ┆ 7 │
└─────┴─────┴─────┘
```

A group-by operation produces a single row per group:
```python
df.group_by('a').agg('b')
```
```
shape: (2, 2)
┌─────┬───────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 2 ┆ [7, 6, 7] │
│ 1 ┆ [4, 1, 2] │
└─────┴───────────┘
```

If we want a single scalar value per group, we can use a reduction ('mean', 'sum', 'std', ...):
```python
df.group_by('a').agg(pl.sum('b'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm already a little lost. Is this Polars or pandas?

```
```python
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 20 │
│ 1 ┆ 7 │
└─────┴─────┘
```

### Group-by in pandas

If you're coming from a pandas-like library, you may have been used to writing the above example as

```python
df.groupby('a')['b'].sum()
```

And indeed, for such a simple task as this one, the pandas API is quite nice. All we had to do was:

1. select which columns we're grouping by
2. select the column(s) we want to aggregate
3. specify an elementary aggregation function

Let's try something else: "find the maximum value of 'c', where 'b' is greater than its mean, per
group 'a'".

Unfortunately, the pandas API has no way to express this, meaning
that no library which copies the pandas API can truly optimise such an
operation in the general case.

You might be wondering whether we can just do the following:
```python
df.groupby('a').apply(
lambda df: df[df['b'] > df['b'].mean()]['c'].max()
)
```

However, that uses a Python `lambda` function and so is generally going to be inefficient.

Another solution one might come up with is this one:
```python
df[df['b'] > df.groupby('a')['b'].transform('mean')].groupby('a')['c'].max()
```
It's not as bad as the `apply` solution above, but it still looks overly complicated and requires
two group-bys.

There's actually a third solution, which:

- relies on `GroupBy` caching its groups
- performs in-place mutation of the original dataframe
- uses the fact that `'max'` skips over missing values

Realistically, few users would come up with it (most would jump straight to `apply`), but for
completeness, we present it:
```python
gb = df.groupby("a")
mask = df["b"] > gb["b"].transform("mean")
df["result"] = np.where(mask, df["c"], np.nan)
gb["result"].max()
```

Surely it's possible to do better?

## Non-elementary group-bys in Polars

The Polars API lets us pass [expressions](https://docs.pola.rs/user-guide/concepts/expressions/) to `GroupBy.agg`.
So long as you can express your aggregation as
an expression, you can use it in a group-by setting. In this case, we can express "the maximum value
of 'c' where 'b' is greater than its mean" as
```python
pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
```
Then, all we need to do is pass this expression to `GroupBy.agg`:

```python
df.group_by('a').agg(
pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
)
```
Wonderful! Like this, we can express the operation cleanly and without hacks, meaning that any dataframe
implementation which follows the Polars API has the possibility to evaluate this efficiently.

## Conclusion, and a plea to future dataframe authors

We've learned about the group-by operation, elementary aggregations in both pandas and Polars, and how
Polars' syntax enables users to cleanly express non-elementary aggregations.

pandas is a wonderful tool which solves a lot of real problems for a lot of real people.
However, by insisting on copying its API, the dataframe landscape is stunting its own potential.
If an API doesn't allow for an operation to be expressed and users end up using `apply` with custom
Python lambda functions, then no amount of acceleration is going to make up for that.

On the other hand, innovation on the API side can enable new possibilities. There's a reason
that

> I came for the speed, but stayed for the syntax

is a common refrain among Polars users.

For years, the unwritten consensus among dataframe library authors was that if you wanted to
get adoption, then you needed to mimic the pandas API. However, Polars' success calls that into question.
There may be a more general lesson to be learned here: try doing things differently, and then maybe - just
maybe - you might be rewarded.

If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation
using Polars, Quansight is here to help - [please get in touch](https://quansight.com/about-us/#bookacallform) -
we'd love to hear from you.