Skip to content

Commit

Permalink
improve structure
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcoGorelli committed May 10, 2024
1 parent 4732c8e commit ff7d28d
Showing 1 changed file with 46 additions and 42 deletions.
88 changes: 46 additions & 42 deletions apps/labs/posts/dataframe-group-by.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,28 @@
---
title: 'The Polars innovation nobody is talking about'
title: 'The "Polars vs pandas" difference nobody is talking about'
authors: [marco-gorelli]
published: April 26, 2024
published: May 10, 2024
description: 'A closer look at non-elementary group-by aggregations'
category: [PyData ecosystem]
---

# The Polars innovation nobody is talking about
# The "Polars vs pandas" difference nobody is talking about

I attended PyData Berlin 2024 this week, and was pleased to see so much talk of Polars. Some people
even gathered together for a Polars-themed dinner! There's a lot of advantages people bring up
when talking about Polars:
I attended PyData Berlin 2024 this week, and it was a blast! I met so many colleagues, collaborators, and friends.
There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! It's certainly nice to see
people talking about it, and the focus tends to be on features such as:

- lazy execution
- Rust
- consistent handling of null values
- multithreading
- query optimisation
- lazy execution;
- Rust;
- consistent handling of null values;
- multithreading;
- query optimisation.

yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations.
Let's give it the attention it deserves.
Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations.

We'll start by introducing the group-by operation. We'll then take a look at elementary aggregations
with both the pandas and Polars syntaxes. Finally, we'll see how the Polars syntax enables
non-elementary group-by aggregations.
I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations
with both the pandas and Polars syntaxes. Finally, we'll look at non-elementary aggregations, and see
how the Polars syntax enables so much more than the pandas one.

## What's a group-by?

Expand All @@ -45,11 +44,10 @@ shape: (6, 3)
└─────┴─────┴─────┘
```

A group-by operation results in a single row per group. For example, if we do
A group-by operation produces a single row per group:
```python
df.group_by('a').agg('b')
```
we end up with
```
shape: (2, 2)
┌─────┬───────────┐
Expand Down Expand Up @@ -78,26 +76,26 @@ shape: (2, 2)
└─────┴─────┘
```

## Group-by in pandas
### Group-by in pandas

If you're coming from a pandas-like library, you may have been used to writing the above example as

```python
df.groupby('a')['b'].sum()
```

That's a solid API:
For this task, that's a nice API:

- select which columns you're grouping by;
- select the column(s) you want to aggregate;
- specify an elementary aggregation function.

Let's try something else: "find the maximum value of 'c', where 'b' is greater than its mean, per
group 'a'. How do we express this with the pandas API?
group 'a'".

I have no idea. Here's all I can come up with:

### 1. Use a user-defined function
Unfortunately, the pandas API has no way to express this, meaning
that no library which copies pandas can truly optimise such an
operation in the general case.

We could do
```python
Expand All @@ -106,49 +104,52 @@ df.groupby('a').apply(
)
```

However, if there's one lesson you should learn about working with dataframes, it's that any time
you find yourself writing
```python
.apply(lambda
```
you're probably "shooting yourself in the foot". It's only intended as a last-resort, and isn't
something which implementation can easily parse and optimise.

### 2. Perform multiple group-bys
However, that uses a Python `lambda` function and so is generally going to be inefficient.

Another solution I can think of is
Another solution could be to do:
```python
df[df['b'] > df.groupby('a')['b'].transform('mean')].groupby('a')['c'].max()
```
this isn't too bad, but it involves doing two group-bys, and so is at least twice as slow as it could
This isn't too bad, but it involves doing two group-bys, and so is at least twice as slow as it could
be.

Can we do better?
Finally, can rely on `GroupBy` caching its groups, in-place mutation of the original dataframe, and the
fact that `'max'` skips over missing values, to write:
```python
gb = df.groupby("a")
mask = df["b"] > gb["b"].transform("mean")
df["result"] = np.where(mask, df["c"], np.nan)
gb["result"].max()
```
This works, but it involved quite some manual optimisation, and is definitely not a general solution.

Surely it's possible to do better?

## Non-elementary group-bys in Polars

Yes we can! The Polars API lets us pass expressions to `GroupBy.agg`. We can express "the maximum value
The Polars API lets us pass expressions to `GroupBy.agg`. So long as you can express your aggregation as
an expression, you can use it in a group-by setting. In this case, we can express "the maximum value
of 'c' where 'b' is greater than its mean" as
```python
pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
```
We can then insert it into `GroupBy.agg` and get
Then, all we need to do is pass this expression to `GroupBy.agg`:

```python
df.group_by('a').agg(
pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
)
```
Wonderful! If there's a syntax which can express this operation, then an implementation can optimise
it.
Wonderful! We could express the operation cleanly using the library's own API, meaning that any implementation
of the Polars API (not necessarily Polars itself) has the possibility to evaluate this efficiently.

## Conclusion, and a plea to future dataframe authors

We've learned about the group-by operation, elementary aggregations in both pandas and Polars, and how
Polars' syntax enables optimisation of non-elementary aggregations.

pandas is a wonderful tool which solves a lot of real problems for a lot of real people.
But please, stop copying its API. Not matter how much effort you put into
But please, stop copying its API. No matter how much effort you put into
your implementation, if your API is limited and can't express non-elementary group-by aggregations,
then you're going to run into a wall at some point. Good luck reverse-parsing the bytecode of a user's
Python lambda function.
Expand All @@ -160,4 +161,7 @@ that
is a common refrain among Polars users.

There may be a more general lesson here: have the courage to do things differently.
There may be a more general lesson here: if you have the courage to do things differently, you may be rewarded.

If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation
using Polars, Quansight is here to help - you can get in touch [here](https://quansight.com/about-us/#bookacallform).

0 comments on commit ff7d28d

Please sign in to comment.