Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) #12420

Closed
mcrumiller opened this issue Nov 13, 2023 · 31 comments
Closed

Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) #12420

mcrumiller opened this issue Nov 13, 2023 · 31 comments
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Nov 13, 2023

Description

Having a quick expression to grab an integer-based row number is often requested and would be pretty useful. Instead of the usual options:

df.with_row_count("#").select(col("#") < ...).drop(#") # or:
df.select(pl.int_range(0, pl.count()) < ...)

This would simplify to:

df.select(pl.row_num() < ...)
@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Nov 13, 2023
@mcrumiller mcrumiller changed the title Add pl.row() as syntactic sugar for pl.int_range(0, pl.count()) Add pl.row_nbr() as syntactic sugar for pl.int_range(0, pl.count()) Nov 13, 2023
@lyngc
Copy link

lyngc commented Nov 13, 2023

Prefer pl.row_num() if anything. nbr is not an intuitive abbreviation in this case imo.

@mcrumiller mcrumiller changed the title Add pl.row_nbr() as syntactic sugar for pl.int_range(0, pl.count()) Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) Nov 13, 2023
@cmdlineluser
Copy link
Contributor

A somewhat related suggestion was for pl.cumcount() to be a top-level function. #7120 (although that is unsigned int)

In trying to see what this is called elsewhere, it seems row_number() is commonly used (SQL, Spark, TidyVerse, ...?) - perhaps it makes sense to line up with those.

@stinodego
Copy link
Member

stinodego commented Nov 14, 2023

There already is DataFrame.with_row_count:
https://pola-rs.github.io/polars/docs/python/dev/reference/dataframe/api/polars.DataFrame.with_row_count.html

Adding this new expression shortcut seems fine with me. It should either be called pl.row_count() to align with the existing DataFrame method, or it both should be renamed to row_number. My preference would be row_number.

The resulting data type should be UInt32, or UInt64 in bigidx Polars.

@stinodego stinodego added the accepted Ready for implementation label Nov 14, 2023
@deanm0000
Copy link
Collaborator

I think row_number is "better" but I don't think it's so much better that it's worth renaming with_row_count.

What about naming the expression row_number and deprecating with_row_count? I realize deprecating it is more severe than just renaming it. Personally, I've never wanted with_row_count on a stand alone basis. I only ever use it as a precursor to do something else in a with_columns. If I can get row_number directly in the context then I, for one, would never use with_row_count.

Apart from how I use it, it does seem odd that it's the only with_* to supplement with_columns. There's no with_summary or with_shifted or anything else so it may just be good to retire as being out of place.

@cmdlineluser
Copy link
Contributor

@deanm0000

I think someone pointed out in a previous discussion that with_row_count is useful because it inserts the column at a specified position.

If it was only an expression you would need to use .select()

df.select(pl.row_number(), pl.all())

(Or, at least, that's what I think the suggestion was.)

@stinodego
Copy link
Member

stinodego commented Nov 14, 2023

Keep in mind that we don't need any of these methods - they are simply for convenience.

With that in mind, the DataFrame method makes sense because row numbers make a lot of sense in a DataFrame context - people trying to mimic indexes.

I think it's fine to have both.

@deanm0000
Copy link
Collaborator

Wait a minute. Shouldn't this just be adding pl.cumcount()? Is there any different behavior between let's say df.with_columns(i=pl.first().cumcount()) than df.with_columns(i=pl.int_range(0, pl.count())?

It seems they do the same thing even if pl.first() has some nulls. Of course if we had pl.cumcount() then having nulls in any columns would be moot anyways.

Also, if I do timeit on a 1M row df, then cumcount takes 272 µs ± 30.8 µs per loop whereas int_range takes 1.63 ms ± 30.5 µs per loop.

@stinodego stinodego added this to the 1.0.0 milestone Nov 14, 2023
@cmdlineluser
Copy link
Contributor

They differ in types, .cumcount() is unsigned (but so is .with_row_count())

pl.select(pl.lit('foo').cumcount()).schema
# OrderedDict([('literal', UInt32)])
pl.select(pl.int_range(0, 1)).schema
# OrderedDict([('int', Int64)])
pl.select(pl.lit('foo')).with_row_count().schema
# OrderedDict([('row_nr', UInt32), ('literal', Utf8)])

I guess the idea of wanting Int64 is so you don't have to cast if you want to use it e.g.

.rolling(index_column = pl.foo(), ...

versus

.rolling(index_column = pl.foo().cast(int), ...

@Wainberg
Copy link
Contributor

Wainberg commented Jan 6, 2024

+1 for having pl.row_number() and df.with_row_number(), each of which create a row_number column. Although having pl.row_count() and df.with_row_count() create a row_count column also works, it's most important for all three of them to be the same.

@nameexhaustion
Copy link
Collaborator

Small nit - if it's called "row number", should it start counting from 1 by default like SQL does instead of 0?

@stinodego
Copy link
Member

stinodego commented Jan 7, 2024

Small nit - if it's called "row number", should it start counting from 1 by default like SQL does instead of 0?

I considered this but I don't think so. We start counting at 0, right? I find the SQL behavior here strange.

The cum_count function on the other should should start counting at 1.

@lyngc
Copy link

lyngc commented Jan 7, 2024

I would expect row_number() to behave the same as it does in SQL, so start counting from 1.

@mcrumiller
Copy link
Contributor Author

If that's the case then I think I would prefer row_index. Having zero-indexed rows is the most common use case.

@stinodego
Copy link
Member

stinodego commented Jan 7, 2024

I also considered row_index. Maybe that's better indeed...

@orlp
Copy link
Collaborator

orlp commented Jan 7, 2024

My vote goes to pl.row_index(), starting counting at zero (at least by default). I would also be in support of a rename of with_row_count to with_row_index for consistency.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 7, 2024

Would it be too much clutter to include both? a with_row_count for those who want to match SQL and with_row_index for those who want zero-indexed behavior?

An alternative is with_row_index(start=0) which would be one function, with default behavior of 0-indexed, and allow for easy arbitrary start positions.

Come to think of it, this would be a pretty useful in general: if you're building up a series of frames and want the next frame to start at the next index, you could begin at whatever start you want.

@orlp
Copy link
Collaborator

orlp commented Jan 7, 2024

@mcrumiller with_row_count/with_row_index already support an offset parameter. You can pass offset = 1 and get SQL's behavior.

Thinking about it some more I don't think it'd really be necessary for pl.row_index() since you can just write pl.row_index() + offset.

@ritchie46
Copy link
Member

There

They differ in types, .cumcount() is unsigned (but so is .with_row_count())

pl.select(pl.lit('foo').cumcount()).schema
# OrderedDict([('literal', UInt32)])
pl.select(pl.int_range(0, 1)).schema
# OrderedDict([('int', Int64)])
pl.select(pl.lit('foo')).with_row_count().schema
# OrderedDict([('row_nr', UInt32), ('literal', Utf8)])

I guess the idea of wanting Int64 is so you don't have to cast if you want to use it e.g.

.rolling(index_column = pl.foo(), ...

versus

.rolling(index_column = pl.foo().cast(int), ...

We can except a dtype in the expression.

I'd like to slow this down a little bit. I don't think we should add an expression for this. We have ranges, a well known construct in many programming languages.

@Wainberg
Copy link
Contributor

Wainberg commented Jan 7, 2024

We have ranges, a well known construct in many programming languages.

Yes, though I'd argue pl.int_range(0, pl.count()).alias('row_index') is quite a mouthful compared to pl.row_index() for such a common operation!

In practice people would probably just use with_row_count() or whatever it gets renamed to. Though by your argument, shouldn't with_row_count() also be deprecated? After all, we have ranges ;)

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 7, 2024

Though by your argument, shouldn't with_row_count() also be deprecated? After all, we have ranges ;)

This doesn't really help when trying to bring someone to your point of view....

That being said, I agree that this whole conversion is being muddled. I think many of us agree that being able to quickly index columns and frames by integer number is very useful and common enough that it warrants some function. The cleanest solution to me is a single expression pl.row_index(), and alongside of that a single dataframe function df.with_row_index(). I even think pl.index() alone would be nice, but it may cause confusion with pandas indexes.

# default input args shown for clarity
df.select(pl.row_index(start=0, dtype=pl.UInt32))
df.with_row_index(start=0, dtype=pl.UInt32)  # 

@stinodego
Copy link
Member

I also think it's worth adding a pl.index() function, but if Ritchie isn't on board, we don't need to rush this. I'll add a note to the int_range docs with an example on how to create an index column.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2024
@stinodego stinodego removed the accepted Ready for implementation label Jan 8, 2024
@stinodego
Copy link
Member

stinodego commented Jan 8, 2024

Note that in the next release, you will be able to write pl.int_range(pl.len()) - a shorter syntax omitting the zero (#13530).

@mcrumiller
Copy link
Contributor Author

I know this is closed, but I really think a pl.row_index() expression would save a lot of grief when people want to deal with row numbers, over having to use pl.int_range(pl.len()). It's a lot clearer what the intent is. I can't tell from the above comments if it was firmly decided "no" we don't want that, or if we wanted to discuss more.

@stinodego
Copy link
Member

As per Ritchie's comment:

I don't think we should add an expression for this. We have ranges, a well known construct in many programming languages.

So it is marked as "not planned". We may change our tune in the future.

@mcrumiller
Copy link
Contributor Author

Use pl.int_range(pl.len()):

import polars as pl

df = pl.DataFrame({
    "group": [1, 1, 1, 2, 2],
    "value": [1, None, 2, 3, 4],
})

df.sort('value', descending=True).with_columns(pl.int_range(1, pl.len()+1).over('group').alias('nth'))
shape: (5, 3)
┌───────┬───────┬─────┐
│ group ┆ value ┆ nth │
│ ---   ┆ ---   ┆ --- │
│ i64   ┆ i64   ┆ i64 │
╞═══════╪═══════╪═════╡
│ 1     ┆ null  ┆ 1   │
│ 2     ┆ 4     ┆ 1   │
│ 2     ┆ 3     ┆ 2   │
│ 1     ┆ 2     ┆ 2   │
│ 1     ┆ 1     ┆ 3   │
└───────┴───────┴─────┘

@mkleinbort-ic
Copy link

Sorry I deleted my comment, I realized I could just use .rank()

@mcrumiller
Copy link
Contributor Author

Careful though, rank won't give you a rank for nulls.

@mkleinbort-ic
Copy link

Thank you.

In my case I don't have nulls so this will be ok.

Can't say I love pl.int_range(1, pl.len()+1), but this is a closed issue and thus not a place to discuss.

@mcrumiller
Copy link
Contributor Author

@mkleinbort-ic I work around this now by always defining a row_index expression:

import polars as pl

def row_index():
    return pl.int_range(1, pl.len()+1)

df = pl.DataFrame({
    "group": [1, 1, 1, 2, 2],
    "value": [1, None, 2, 3, 4],
})
df.sort('value', descending=True).with_columns(row_index().over('group').alias('nth'))

Doesn't have to be a function but it's more obvious that it's computing something.

@Oreilles
Copy link

I also really think that pl.row_index() would be a valuable addition to polars.

Using pl.int_range(pl.len()) is more verbose and do not reflect the intent of pl.row_index().

Furthermore, int_range returns pl.Int64 wherehas with_row_index() returns pl.Uint32, meaning that the two are not even compatible by default if you were to use these as a join key, which is the primary use case for an row index.

@Oreilles
Copy link

Oreilles commented Jul 24, 2024

Actually, there is also a significant difference between pl.int-range(pl.len()) and the requested pl.row_index(): int_range doesn't return an elementwise expression, meaning that it cannot be used in join:

a = pl.DataFrame({"x": ["a","b","c"]})
b = pl.DataFrame({"y": ["d","e"]})

a.join(
    b,
    left_on=pl.int_range(pl.len()).mod(len(b)),
    right_on=pl.int_range(pl.len()),
)
# InvalidOperationError: All join key expressions must be elementwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests