improve structure

Quansight · May 10, 2024 · ff7d28d · ff7d28d
1 parent 4732c8e
commit ff7d28d
Showing 1 changed file with 46 additions and 42 deletions.
diff --git a/apps/labs/posts/dataframe-group-by.md b/apps/labs/posts/dataframe-group-by.md
@@ -1,29 +1,28 @@
 ---
-title: 'The Polars innovation nobody is talking about'
+title: 'The "Polars vs pandas" difference nobody is talking about'
 authors: [marco-gorelli]
-published: April 26, 2024
+published: May 10, 2024
 description: 'A closer look at non-elementary group-by aggregations'
 category: [PyData ecosystem]
 ---
 
-# The Polars innovation nobody is talking about
+# The "Polars vs pandas" difference nobody is talking about
 
-I attended PyData Berlin 2024 this week, and was pleased to see so much talk of Polars. Some people
-even gathered together for a Polars-themed dinner! There's a lot of advantages people bring up
-when talking about Polars:
+I attended PyData Berlin 2024 this week, and it was a blast! I met so many colleagues, collaborators, and friends.
+There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! It's certainly nice to see
+people talking about it, and the focus tends to be on features such as:
 
-- lazy execution
-- Rust
-- consistent handling of null values
-- multithreading
-- query optimisation
+- lazy execution;
+- Rust;
+- consistent handling of null values;
+- multithreading;
+- query optimisation.
 
-yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations.
-Let's give it the attention it deserves.
+Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations.
 
-We'll start by introducing the group-by operation. We'll then take a look at elementary aggregations
-with both the pandas and Polars syntaxes. Finally, we'll see how the Polars syntax enables
-non-elementary group-by aggregations.
+I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations
+with both the pandas and Polars syntaxes. Finally, we'll look at non-elementary aggregations, and see
+how the Polars syntax enables so much more than the pandas one.
 
 ## What's a group-by?
 
@@ -45,11 +44,10 @@ shape: (6, 3)
 └─────┴─────┴─────┘
 ```
 
-A group-by operation results in a single row per group. For example, if we do
+A group-by operation produces a single row per group:
 ```python
 df.group_by('a').agg('b')
 ```
-we end up with
 ```
 shape: (2, 2)
 ┌─────┬───────────┐
@@ -78,26 +76,26 @@ shape: (2, 2)
 └─────┴─────┘
 ```
 
-## Group-by in pandas
+### Group-by in pandas
 
 If you're coming from a pandas-like library, you may have been used to writing the above example as
 
 ```python
 df.groupby('a')['b'].sum()
 ```
 
-That's a solid API:
+For this task, that's a nice API:
 
 - select which columns you're grouping by;
 - select the column(s) you want to aggregate;
 - specify an elementary aggregation function.
 
 Let's try something else: "find the maximum value of 'c', where 'b' is greater than its mean, per
-group 'a'. How do we express this with the pandas API?
+group 'a'".
 
-I have no idea. Here's all I can come up with:
-
-### 1. Use a user-defined function
+Unfortunately, the pandas API has no way to express this, meaning
+that no library which copies pandas can truly optimise such an
+operation in the general case.
 
 We could do
 ```python
@@ -106,49 +104,52 @@ df.groupby('a').apply(
 )
 ```
 
-However, if there's one lesson you should learn about working with dataframes, it's that any time
-you find yourself writing
-```python
-.apply(lambda
-```
-you're probably "shooting yourself in the foot". It's only intended as a last-resort, and isn't
-something which implementation can easily parse and optimise.
-
-### 2. Perform multiple group-bys
+However, that uses a Python `lambda` function and so is generally going to be inefficient.
 
-Another solution I can think of is
+Another solution could be to do:
 ```python
 df[df['b'] > df.groupby('a')['b'].transform('mean')].groupby('a')['c'].max()
 ```
-this isn't too bad, but it involves doing two group-bys, and so is at least twice as slow as it could
+This isn't too bad, but it involves doing two group-bys, and so is at least twice as slow as it could
 be.
 
-Can we do better?
+Finally, can rely on `GroupBy` caching its groups, in-place mutation of the original dataframe, and the
+fact that `'max'` skips over missing values, to write:
+```python
+gb = df.groupby("a")
+mask = df["b"] > gb["b"].transform("mean")
+df["result"] = np.where(mask, df["c"], np.nan)
+gb["result"].max()
+```
+This works, but it involved quite some manual optimisation, and is definitely not a general solution.
+
+Surely it's possible to do better?
 
 ## Non-elementary group-bys in Polars
 
-Yes we can! The Polars API lets us pass expressions to `GroupBy.agg`. We can express "the maximum value
+The Polars API lets us pass expressions to `GroupBy.agg`. So long as you can express your aggregation as
+an expression, you can use it in a group-by setting. In this case, we can express "the maximum value
 of 'c' where 'b' is greater than its mean" as
 ```python
 pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
 ```
-We can then insert it into `GroupBy.agg` and get
+Then, all we need to do is pass this expression to `GroupBy.agg`:
 
 ```python
 df.group_by('a').agg(
     pl.col('c').filter(pl.col('b') > pl.mean('b')).max()
 )
 ```
-Wonderful! If there's a syntax which can express this operation, then an implementation can optimise
-it.
+Wonderful! We could express the operation cleanly using the library's own API, meaning that any implementation
+of the Polars API (not necessarily Polars itself) has the possibility to evaluate this efficiently.
 
 ## Conclusion, and a plea to future dataframe authors
 
 We've learned about the group-by operation, elementary aggregations in both pandas and Polars, and how
 Polars' syntax enables optimisation of non-elementary aggregations.
 
 pandas is a wonderful tool which solves a lot of real problems for a lot of real people.
-But please, stop copying its API. Not matter how much effort you put into
+But please, stop copying its API. No matter how much effort you put into
 your implementation, if your API is limited and can't express non-elementary group-by aggregations,
 then you're going to run into a wall at some point. Good luck reverse-parsing the bytecode of a user's
 Python lambda function.
@@ -160,4 +161,7 @@ that
 
 is a common refrain among Polars users.
 
-There may be a more general lesson here: have the courage to do things differently.
+There may be a more general lesson here: if you have the courage to do things differently, you may be rewarded.
+
+If you'd like to learn about how to use Polars effectively, or how to solve problems in your organisation
+using Polars, Quansight is here to help - you can get in touch [here](https://quansight.com/about-us/#bookacallform).