feat: Add `cum_count` expression function #13478

stinodego · 2024-01-06T06:45:20Z

When used without arguments, this is basically just syntactic sugar for int_range(1, count()+1, 1, dtype=IndexDtype).
I added this to the Rust side as we have no notion of IndexType in Python and I think it's nice to have in Rust anyway.

With arguments it's syntactic sugar for pl.col(...).cum_count(). It makes sense to have this along the other 'shortcut' functions we already have.

crates/polars-plan/src/dsl/mod.rs

cmdlineluser · 2024-01-06T13:48:07Z

Perhaps it could also be a good time to deprecate the reverse= named arg.

pl.cum_count(reverse=True)
pl.cum_count().reverse()

(Expr.reverse() did not exist when .cumcount() was initially added - hence the need for the named arg at the time.)

stinodego · 2024-01-06T23:05:18Z

Perhaps it could also be a good time to deprecate the reverse= named arg.

I'm not entirely sure. All cum_* functions have a reverse option, might as well include it in this one.

While it is technically redundant, otherwise you have to write df.select(pl.col("a").reverse().cum_sum().reverse()). Which feels bad.
For the cum_count function, a single reverse will do the trick. But let's be consistent and keep the reverse param for all cumulative functions.

rben01 · 2024-01-12T19:33:59Z

This counts nulls as +1? Seems confusing when df.count() skips over nulls; I would expect the last element in the column represented by pl.col("x").cum_count() to be the same as df.count().get_column("x"). Like this:

┌──────┬───────────┬─────────┐
│ a    ┆ cum_count ┆ cum_len │
│ ---  ┆ ---       ┆ ---     │
│ i64  ┆ i64       ┆ i64     │
╞══════╪═══════════╪═════════╡
│ 1    ┆ 1         ┆ 1       │
│ null ┆ 1         ┆ 2       │
│ 3    ┆ 2         ┆ 3       │
└──────┴───────────┴─────────┘

Not sure if cum_len is a good name — other options include numbered, row_number, row_index1 (or just skip it altogether and use expr.with_row_index() + 1) — but I would definitely consider cum_count misleading.

stinodego · 2024-01-12T20:40:53Z

I don't know what you mean. It functions exactly as you say it does - we made it match df.count(). Both ignore null values now.

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, None, 6]})

print(df.count())
print(df.select(pl.col("a", "b").cum_count()))

shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 3   ┆ 2   │
└─────┴─────┘
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 1   │
│ 3   ┆ 2   │
└─────┴─────┘

rben01 · 2024-01-12T21:43:12Z

Oh, sorry never mind.. I think I got confused by the int_range(1, count()+1, 1, dtype=IndexDtype) part unconditionally returning row numbers. (And by the fact that elsewhere in polars, such as group_by, count means number of rows, not number of non-nulls.)

stinodego added 2 commits January 6, 2024 07:17

Add Rust side

4557dc4

Add tests

9afb347

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 6, 2024

Implement Python side

6da61bd

stinodego force-pushed the cum-count branch from 25a9c33 to 6da61bd Compare January 6, 2024 06:48

stinodego marked this pull request as ready for review January 6, 2024 06:49

stinodego requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli and orlp as code owners January 6, 2024 06:49

stinodego changed the title ~~feat: Add cum_count expression~~ feat: Add cum_count expression function Jan 6, 2024

stinodego marked this pull request as draft January 6, 2024 07:48

WIP

10ecb0c

ritchie46 reviewed Jan 6, 2024

View reviewed changes

crates/polars-plan/src/dsl/mod.rs Outdated Show resolved Hide resolved

stinodego force-pushed the cum-count branch from db8b7ce to 2f876b7 Compare January 6, 2024 23:45

Add back reverse param

69c92a6

stinodego force-pushed the cum-count branch from 2f876b7 to 69c92a6 Compare January 6, 2024 23:56

stinodego marked this pull request as ready for review January 7, 2024 16:24

ritchie46 approved these changes Jan 8, 2024

View reviewed changes

ritchie46 merged commit 5e94252 into main Jan 8, 2024
26 checks passed

ritchie46 deleted the cum-count branch January 8, 2024 07:15

stinodego mentioned this pull request Jan 8, 2024

Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) #12420

Closed

c-peters added the accepted Ready for implementation label Jan 14, 2024

c-peters assigned stinodego Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `cum_count` expression function #13478

feat: Add `cum_count` expression function #13478

stinodego commented Jan 6, 2024 •

edited

Loading

cmdlineluser commented Jan 6, 2024

stinodego commented Jan 6, 2024 •

edited

Loading

rben01 commented Jan 12, 2024

stinodego commented Jan 12, 2024 •

edited

Loading

rben01 commented Jan 12, 2024

feat: Add cum_count expression function #13478

feat: Add cum_count expression function #13478

Conversation

stinodego commented Jan 6, 2024 • edited Loading

cmdlineluser commented Jan 6, 2024

stinodego commented Jan 6, 2024 • edited Loading

rben01 commented Jan 12, 2024

stinodego commented Jan 12, 2024 • edited Loading

rben01 commented Jan 12, 2024

feat: Add `cum_count` expression function #13478

feat: Add `cum_count` expression function #13478

stinodego commented Jan 6, 2024 •

edited

Loading

stinodego commented Jan 6, 2024 •

edited

Loading

stinodego commented Jan 12, 2024 •

edited

Loading