Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize with_column & with_columns #6117

Closed
zundertj opened this issue Jan 8, 2023 · 33 comments
Closed

Rationalize with_column & with_columns #6117

zundertj opened this issue Jan 8, 2023 · 33 comments
Labels
breaking Change that breaks backwards compatibility enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@zundertj
Copy link
Collaborator

zundertj commented Jan 8, 2023

Problem description

Initial discussion was on Discord, moving here so everyone can join.

The DataFrame methods with_column and with_columns tend to be used quite often, as these are, together with select, the primary way to run expressions on dataframes. I have a couple of issues with these though:

  1. with_columns allows everything with_column does, except checking that only one expression is passed in. Thus it seems to me very marginal benefit to have two methods. Marginal because I dont think there is ever confusion on whether you pass in one or multiple expressions.

  2. with_column also does not support the (experimental) assignment syntax currently

pl.Config.with_columns_kwargs = True
df.with_columns(b=pl.col("a") * 2)  # works
df.with_column(b=pl.col("a") * 2)  # fails
  1. (per Ritchie's comment on Discord that they return a new dataframe): the docstrings are not clear what is copied and what not:
  • with_column: Return a new DataFrame with the column added or replaced.
  • with_columns: Add or overwrite multiple columns in a DataFrame.
    The latter suggests there is no new dataframe created, but there is.
  1. I find with_columns quite verbose given how often it is used. We have also opted to use pl.col and not pl.column, and pl.lit rather than pl.literal. My initial proposal was assign, but @ritchie46 's point is that this may give the impression no new dataframe is being returned, while there is. So my new proposal would be to shorten to with, except that with is a reserved keyword In Python. So maybe with_col, but that seems quite a marginal gain?

Putting it altogether:

  1. deprecate with_column
  2. rename with_columns to with_col?
  3. fix the docstring of with_columns/with_col to reflect that a new dataframe is being created
@zundertj zundertj added python Related to Python Polars enhancement New feature or an improvement of an existing feature labels Jan 8, 2023
@ritchie46
Copy link
Member

I don't think we can. with is a python keyword.

@zundertj
Copy link
Collaborator Author

zundertj commented Jan 8, 2023

Yep, you are right, that wont fly. Updated the post to reflect that.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 8, 2023

Dropping with_column in favour of with_columns is definitely a good move; in the same vein as the other abbreviations how about shortening to with_cols? Everyone's muscle memory (and autocomplete) will be halfway there already... ;)

Note: passing more than one column to with_column would look really odd, whereas passing one or more to with_columns would look pretty normal, hence favouring deprecation of the singular form.

Also, let's make the currently-experimental calling pattern allowed by default? It has proven really popular over here, and I shudder to think what would happen if it went away...😅

@zundertj zundertj added the breaking Change that breaks backwards compatibility label Jan 8, 2023
@zundertj
Copy link
Collaborator Author

zundertj commented Jan 8, 2023

Note that with_columns already supports the singular form, so it is not just normal, it is the current behaviour.

So I think we have:

  1. deprecate with_column
  2. fix docstring with_columns => fix(python): Update docstring with_columns to reflect a new dataframe is being returned #6122
  3. make experimental calling pattern the default; remove flag.

My hesitation with with_cols or with_col is that it does not seem to be enough of an improvement, first to break with for example Snowflake, and second to do the deprecation. I guess the latter point, given we are 0.x and can easily support this change, is not too big of a problem, but I guess I'm hoping for a better name to come up over time, which would put users through another cycle of updating code.

@alexander-beedie
Copy link
Collaborator

My hesitation with with_cols or with_col is that it does not seem to be enough of an improvement ... would put users through another cycle of updating code.

True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain; probably not what you want right as polars feels like it's getting some real critical mass.

@pjj
Copy link

pjj commented Jan 8, 2023

If one would ever rename with_columns, then perhaps a reasonable renaming would be mutate, which would bring it in line with tidyverse in R.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 8, 2023

If one would ever rename with_columns, then perhaps a reasonable renaming would be mutate, which would bring it in line with tidyverse in R.

It's a very vague/ambiguous name though... like: what does it mutate and how? The only people who might stand a chance at guessing will be the R illuminati ;)

@zundertj
Copy link
Collaborator Author

zundertj commented Jan 8, 2023

mutate would have the same issue as assign, that it suggests the dataframe is modified rather than that a new dataframe is being returned. In that case, I would prefer assign over mutate.

@Marmeladenbrot
Copy link

True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain

The most used functions should get the shortest names because you type it so often imo.

@ritchie46
Copy link
Member

I am in favor of keeping one with_columns. I don't mind a few keystrokes to keep readable english names.

I don't think it's worth a refactor. Once you are at with_c a tab does the rest.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 8, 2023

And in truth it does actually do a good job of telling you what's going to happen, in a way that assign or mutate just don't. You're going to get the DataFrame... with columns, and those columns are going to be the ones you give it; it's very clear.

@ritchie46
Copy link
Member

Glad to see some backings today. Tough day😅

@zundertj
Copy link
Collaborator Author

zundertj commented Jan 8, 2023

I agree that in absence of an equally clear, but shorter name, we should not change. I thought with would be that one, but that won't work in combination with Python. To be sure, for me shorter names are not about keystrokes (with_c<tab> is easy enough), but about readability. Longer names, if not more informative over their shorter counterpart, make imo code harder to follow, as more reading is required, more line breaks are required, etc.

I have raised a draft PR for deprecating with_column, see #6128.

@cbzittoun
Copy link

cbzittoun commented Jan 11, 2023

Why not:

  • create an alias .wc -> .with_columns (or .wcols or whatever short)
  • expose both in the api
  • make clear in the doc that .with_columns is the primary method?

I agree with both keystrokes and readability arguments.

Or another route is to give the option to extend the DataFrame methods.

(btw, my first comment on that project => huge thanks, polars is amazing)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 11, 2023

Why not/ create an alias .wc -> .with_columns (or .wcols or whatever short)

@cbzittoun: Aside from "wc" meaning toilet (seriously ;) it doesn't really offer a meaningful advantage, and it adds clutter to the API surface.

Or another route is to give the option to extend the DataFrame methods.

Now that is actually supported - as long as you do it in a custom namespace:
https://pola-rs.github.io/polars/py-polars/html/reference/api.html

@cbzittoun
Copy link

ahah yes but there is no reason the bathroom namespace should apply here

thanks for the the other solution, that's useful

@AlexandreDecan
Copy link

I've a small suggestion (that could break things, though) about with_columns. Currently, this method accepts a single parameter that is either an expression or a list of expressions (i.e., df.with_columns([..., ...]) to create two new columns). I propose to use *args instead so that df.with_columns([..., ...]) can be written df.with_columns(..., ...). This has several advantages:

(1) It does not require to explicitly create a list of expressions (and prevents writing extra [ and ]);
(2) It supports both the creation of a single new column or many at once (so the API for a single column won't change, and adding more columns just requires providing more parameters instead of converting the parameter to a list and extend it);
(3) It's probably more Pythonic ;-)
(4) Support for named expressions is really interesting (and I hope it will be enabled by default soon) since it makes the names of the new columns visible before their definition, as in pandas' .assign (using .alias means that the name can only be found at the end of the expression). By using *args instead of a list in with_columns, we increase the consistency since one can easily go from a **kwargs-based call to an *args-based call and vice-versa:

df.with_columns(
  expr1, 
  expr2, 
  expr3,
)

can be easily converted to/from:

df.with_columns(
   a=expr1, 
   b=expr2, 
   c=expr3,
)

@arturdaraujo
Copy link

I also like the idea of pl.Config.with_columns_kwargs = True being enabled by default. When using black code format the code "expands". I'm not familiar with the reasoning for using lists or if changing it to **kwargs-based would be worth it, though.

@AlexandreDecan
Copy link

I'm not suggesting to go for a full (and only) kwargs-based api but for a combination of *args and **kwargs. In particular I think it would make more sense to use *args instead of a list. But for consistency, that means that select, filter, col, etc should probably be "converted" to receive *args instead of a list, which may lead to backward incompatibilities...

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 23, 2023

@AlexandreDecan: the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method; you would be amazed at how easy it is for people to forget to splat them in (eg: with_columns(*expressions)) and then wander all over the code wondering where the error is.

I've designed significant APIs both ways in the past, but that tends to take users a while to really ingrain (which isn't to say I don't like it, as I currently have an API using this style at work - however, it was designed to be that way from the start, and is consistent across all interfaces). Without a compelling advantage (vs some minor niceties), I doubt we'd want to make a breaking change like that across multiple commonly-used methods without something/someone really driving it...

@AlexandreDecan
Copy link

Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of **kwargs and rather to allow to pass a dictionary instead of a list to support named expressions, i.e., df.with_columns({'a': expr1, 'b': expr2, 'c': expr3}) instead of df.with_coumns(a=expr1, b=expr2, c=expr3).

That way, the number of parameters is always exactly 1, and there's a parallel between passing a list of unnamed expressions and passing a dict of named expressions (this parallel could have been achieved by using *args and **kwargs, but I understand you prefer to avoid them).

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 23, 2023

Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of **kwargs and rather to allow to pass a dictionary instead of a list to support named expressions, i.e.,

That would be very unergonomic - there is no reason for **kwargs style calls (an extremely common calling convention in python) to squeeze into a single parameter as a dict; I'd argue that would lose the reason for supporting this mode in the first place, which is primarily about how simple/pythonic it is ;)

@AlexandreDecan
Copy link

Perhaps less ergonomic, but consistent with your choice for not supporting *args. Anyway, I prefer to have **kwargs rather than nothing :-p

Apart from this, and going back to your previous message in which you said "the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method;", what about supporting all these uses cases? For example:

def with_columns(self, *args, **kwargs):
  if len(args) == 1:
    if isinstance(args[0], list):
      args = args[0]
   ...

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 26, 2023

consistent with your choice for not supporting *args

@AlexandreDecan: I admit I still don't see the parallel, but it seems we may be going for it after all! Note that I was pointing out that it isn't a free lunch, not that it isn't worthwhile - see here for details of the proposal :)

@AlexandreDecan
Copy link

Good news :-)

@zundertj
Copy link
Collaborator Author

zundertj commented Feb 4, 2023

Coming back to the naming of with_columns, on Discord there was the suggestion select_with, which emphasizes that with_columns(new_expr) is equivalent to .select([pl.all(), new_expr])). I also like that it no longer has the columns in the name, we also do not have filter_rows but filter, select rather than select_columns, etc. Downside is that with_columns is somewhat established, but I think overall this is an improvement. Any thoughts?

@AlexandreDecan
Copy link

AlexandreDecan commented Feb 4, 2023

It doesn't really highlight that the data frame is returned as-is with new columns (ie there is no projection).

@ritchie46
Copy link
Member

I really don't think we should remove with_columns. The expressions are added to the DataFrame as new columns. So I think the name is perfect.

@alexander-beedie
Copy link
Collaborator

This consolidation was completed for the 0.16 release, with a deprecation period via transparent @redirect. Prepare to autocomplete that extra "s" , there is now just one method ;) 👍

@shahankhatch
Copy link

Thanks @zundertj for bringing up this topic here. I was the one who recommended select_with on Discord, so I'm adding my comment here.

It's obvious to me that with_columns adds new columns but I think its equivalence to select(pl.all()) doesn't contribute to a simple and consistent terminology.

The latest move from with_column to with_columns shows a desire to add new keywords sparingly rather than for convenience. I think this goal can be applied further to select with an arg to select all (perhaps v messy,), or to use select_with (which becomes a polars-specific keyword but which appears beside the select in the api).

Since there's the extensible api, perhaps another approach that doesn't remove the conveniences is to move with_columns into a utility 'crate' to mimic the myriad functions that motivated the introduction of with_column and with_columns in the first place. It would only be used by those coming from pyspark-ish environments, and I imagine they'd want more than just with_columns.

Seeing the pyspark api for select, selectExpr, withColumn, withColumns, withColumnRenamed, I'm thankful that polars with_column is being obviated by with_columns

Looking forward to any feedback

@ghuls
Copy link
Collaborator

ghuls commented Feb 14, 2023

with_columns is not only used to add new columns, but also to replace existing ones, so with pl.all, you would get duplicated column name errors.

@shahankhatch
Copy link

The same applies to with_columns and groupby, so to me this point doesn't dissuade from cleaning up the api. In fact, the SQL standard doesn't disallow using the same alias.
The following is valid SQL: SELECT designation, hired_on as designation FROM employees
This speaks to me as though polars could evolve to more closely match the robust and familiar SQL specification.

Although it should be noted/discussed separately from this thread, one could argue that polars query planning should accommodate aliases created within a select/with_columns/agg expression for reuse within another array member. The semantics of an array of expressions right now is focused on parallelism. This forces users to manually create the 'intermediate nodes' in the query plan by chaining select/with_columns expressions only after the alias is declared, making the expressions longer and thus less concise.

@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 17, 2023

I like with_columns. It's not a big deviation from what we've been using for the past year when editing one column. Also, with_column (singular) secretly let you operate on multiple columns if you had multiple columns in your pl.col(pl.Float32) (for example), which always seemed wrong/a mistake to me. I didn't create an issue because it seemed like a can of worms which has now thankfully been resolved by the current issue.

I do want to add a grievance I have currently about with_columns that I swore I wrote about elsewhere but I can't find it anywhere: it's not obvious when a column will get replaced and when a new one will be created. If I do:

df.with_columns(col('a')*col('b'))

The default behavior is to modify column a because it happens to come first, despite commutativity. If I call

when(col('a') < 3).then(col('a')).otherwise(col('b'))

I get a modified again. But if I say:

when(col('b') < 9).then(col('a')).otherwise(col('b'))

...column a still gets modified! Is it because it's the first then?

This is super confusing and not obvious at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Change that breaks backwards compatibility enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

No branches or pull requests