-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rationalize with_column & with_columns #6117
Comments
I don't think we can. |
Yep, you are right, that wont fly. Updated the post to reflect that. |
Dropping Note: passing more than one column to Also, let's make the currently-experimental calling pattern allowed by default? It has proven really popular over here, and I shudder to think what would happen if it went away...😅 |
Note that So I think we have:
My hesitation with |
True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain; probably not what you want right as polars feels like it's getting some real critical mass. |
If one would ever rename |
It's a very vague/ambiguous name though... like: what does it mutate and how? The only people who might stand a chance at guessing will be the R illuminati ;) |
|
The most used functions should get the shortest names because you type it so often imo. |
I am in favor of keeping one I don't think it's worth a refactor. Once you are at |
And in truth it does actually do a good job of telling you what's going to happen, in a way that |
Glad to see some backings today. Tough day😅 |
I agree that in absence of an equally clear, but shorter name, we should not change. I thought I have raised a draft PR for deprecating |
Why not:
I agree with both keystrokes and readability arguments. Or another route is to give the option to extend the DataFrame methods. (btw, my first comment on that project => huge thanks, polars is amazing) |
@cbzittoun: Aside from "wc" meaning toilet (seriously ;) it doesn't really offer a meaningful advantage, and it adds clutter to the API surface.
Now that is actually supported - as long as you do it in a custom namespace: |
ahah yes but there is no reason the bathroom namespace should apply here thanks for the the other solution, that's useful |
I've a small suggestion (that could break things, though) about (1) It does not require to explicitly create a list of expressions (and prevents writing extra
can be easily converted to/from:
|
I also like the idea of |
I'm not suggesting to go for a full (and only) kwargs-based api but for a combination of *args and **kwargs. In particular I think it would make more sense to use *args instead of a list. But for consistency, that means that select, filter, col, etc should probably be "converted" to receive *args instead of a list, which may lead to backward incompatibilities... |
@AlexandreDecan: the ergonomic downside with I've designed significant APIs both ways in the past, but that tends to take users a while to really ingrain (which isn't to say I don't like it, as I currently have an API using this style at work - however, it was designed to be that way from the start, and is consistent across all interfaces). Without a compelling advantage (vs some minor niceties), I doubt we'd want to make a breaking change like that across multiple commonly-used methods without something/someone really driving it... |
Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of That way, the number of parameters is always exactly 1, and there's a parallel between passing a list of unnamed expressions and passing a dict of named expressions (this parallel could have been achieved by using |
That would be very unergonomic - there is no reason for |
Perhaps less ergonomic, but consistent with your choice for not supporting Apart from this, and going back to your previous message in which you said "the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method;", what about supporting all these uses cases? For example:
|
@AlexandreDecan: I admit I still don't see the parallel, but it seems we may be going for it after all! Note that I was pointing out that it isn't a free lunch, not that it isn't worthwhile - see here for details of the proposal :) |
Good news :-) |
Coming back to the naming of |
It doesn't really highlight that the data frame is returned as-is with new columns (ie there is no projection). |
I really don't think we should remove |
This consolidation was completed for the |
Thanks @zundertj for bringing up this topic here. I was the one who recommended It's obvious to me that The latest move from Since there's the extensible api, perhaps another approach that doesn't remove the conveniences is to move
Looking forward to any feedback |
with_columns is not only used to add new columns, but also to replace existing ones, so with pl.all, you would get duplicated column name errors. |
The same applies to with_columns and groupby, so to me this point doesn't dissuade from cleaning up the api. In fact, the SQL standard doesn't disallow using the same alias. Although it should be noted/discussed separately from this thread, one could argue that polars query planning should accommodate aliases created within a select/with_columns/agg expression for reuse within another array member. The semantics of an array of expressions right now is focused on parallelism. This forces users to manually create the 'intermediate nodes' in the query plan by chaining select/with_columns expressions only after the alias is declared, making the expressions longer and thus less concise. |
I like I do want to add a grievance I have currently about df.with_columns(col('a')*col('b')) The default behavior is to modify column when(col('a') < 3).then(col('a')).otherwise(col('b')) I get when(col('b') < 9).then(col('a')).otherwise(col('b')) ...column This is super confusing and not obvious at all. |
Problem description
Initial discussion was on Discord, moving here so everyone can join.
The DataFrame methods
with_column
andwith_columns
tend to be used quite often, as these are, together withselect
, the primary way to run expressions on dataframes. I have a couple of issues with these though:with_columns
allows everythingwith_column
does, except checking that only one expression is passed in. Thus it seems to me very marginal benefit to have two methods. Marginal because I dont think there is ever confusion on whether you pass in one or multiple expressions.with_column
also does not support the (experimental) assignment syntax currentlywith_column
: Return a new DataFrame with the column added or replaced.with_columns
: Add or overwrite multiple columns in a DataFrame.The latter suggests there is no new dataframe created, but there is.
with_columns
quite verbose given how often it is used. We have also opted to usepl.col
and notpl.column
, andpl.lit
rather thanpl.literal
. My initial proposal wasassign
, but @ritchie46 's point is that this may give the impression no new dataframe is being returned, while there is. So my new proposal would be to shorten towith
, except thatwith
is a reserved keyword In Python. So maybewith_col
, but that seems quite a marginal gain?Putting it altogether:
with_column
with_columns
towith_col
?with_columns
/with_col
to reflect that a new dataframe is being createdThe text was updated successfully, but these errors were encountered: