Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviours due to #6497 #6594

Closed
2 tasks done
AlexandreDecan opened this issue Jan 31, 2023 · 6 comments · Fixed by #6615
Closed
2 tasks done

Inconsistent behaviours due to #6497 #6594

AlexandreDecan opened this issue Jan 31, 2023 · 6 comments · Fixed by #6615
Assignees
Labels
bug Something isn't working python Related to Python Polars

Comments

@AlexandreDecan
Copy link

AlexandreDecan commented Jan 31, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Hello. First of all, thanks for this really nice library!! ;)
This "issue" is due to the recent change introduced in #6497 that leads to inconsistent outputs.

Reproducible example

import polars as pl
df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5], 
    'b': [10, 20, 30, 40, 50],
})

# Code A
df.with_columns(
    (pl.col('*') * -1).alias('n')
)

# Code B
df.with_columns(
    n=pl.col('*') * -1
)

# Code C
df.select(
    (pl.col('*') * -1).alias('n')
)

# Code D
df.select(
    n=(pl.col('*') * -1)
)

Expected behavior

The current behaviour is:
Code A has a single new column "n" whose results is based exclusively on "b";
Code B has a single new column "n" whose results combine "a" and "b" in a struct;
Code C results in a DuplicateError;
Code D has a single column "n" whose results combine "a" and "b" in a struct.

B and D are consistent, but A and C aren't.

A, B, C and D are not consistent if I assume that using a keyword-based column is roughly equivalent to using .alias, i.e., that df.select(n=expr) would be equivalent to df.select(expr.alias('n')). Said differently: I expected A and B to have the same output, and C and D to have the same output.

Suggestions:

  1. Have a similar behaviour for df.select(n=expr) and df.select(expr.alias('n')) (and its with_columns counterpart). This could be either a DuplicateError or a single column with a struct (my preference goes for the former, reducing the risk of silently having a potentially unexpected behaviour, since one can explicitly create a struct through the expression if needed);
  2. Raise a DuplicateError both for A and C for consistency.
  3. (Not directly related to the current "issue") Prevent the use of .prefix and .suffix when .alias or a keyword-named column is used (or raise a warning ?). Similarly, prevent the use (or raise a warning) when .aliasis used with a keyword-named column.

To summarize:

  • Have a consistent behaviour between .alias and **kwargs;
  • Have a consistent behaviour between select and with_columns (there are perhaps other candidate methods?);
  • Prevent or warn when using a priori conflicting renamings (see Suggestion 3 above).

Installed versions

---Version info---
Polars: 0.16.1
Index type: UInt32
Platform: Linux-5.19.6-100.fc35.x86_64-x86_64-with-glibc2.34
Python: 3.9.13 (main, Jun  9 2022, 00:00:00) 
[GCC 11.3.1 20220421 (Red Hat 11.3.1-2)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.3
fsspec: 2023.1.0
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.5.1
@AlexandreDecan AlexandreDecan added bug Something isn't working python Related to Python Polars labels Jan 31, 2023
@ritchie46
Copy link
Member

I think we must set the structify behavior experimental until we have this figured out. IMO an alias should always do the same as keyword assignment, but I am not convinced we must add this complexity in rust yet. There are more pressing issues.

@alexander-beedie I chime you in on this one.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 31, 2023

Hmm... tricky one; I feel that the intent is stronger when explicitly assigning a multi-output expression to a keyword arg (vs an alias), so I could be argued one way or the other here, though there are obviously clear merits to consistency.

Irrespective of that, I agree that if we don't auto-structify multi-output expressions then we should definitely raise an error; I can't think of many (any?) cases where you would really want to silently eat an expression by masking it with a duplicative name.

Would suggest we do the following:

  • Make structify behaviour consistent between kwarg and alias calling patterns but hide it behind an opt-in option on Config (as we previously did with kwargs) so we can think about it a bit more.
  • Raise an error on multi-output expressions resolving to a single alias (when the Config option is not active).

@AlexandreDecan
Copy link
Author

Hmm... tricky one

Sorry ;-)

I feel that the intent is stronger when explicitly assigning a multi-output expression to a keyword arg (vs an alias)

I do not really share this feeling, since .alias serves to explicitly rename the (one-dimensional) output of an expression.
I initially thought that **kwargs arguments were syntactic sugar for .alias (I even thought it was implemented that way, hence the surprise when playing with both ;-)). Was there a specific reason for not doing so?

Irrespective of that, I agree that if we don't auto-structify multi-output expressions then we should definitely raise an error; I can't think of many (any?) cases where you would really want to silently eat an expression by masking it with a duplicative name.

Perhaps in the context of an expression that has side-effect, but I don't know whether such a use case exists, and I agree with you that it's probably safer to raise in such a case and to require users to explicitly structify the output ("Explicit is better than implicit.") :-)

Would suggest we do the following:
* Make structify behaviour consistent between kwarg and alias calling patterns but hide it behind an opt-in option on Config (as we previously did with kwargs) so we can think about it a bit more.
* Raise an error on multi-output expressions resolving to a single alias (when the Config option is not active).

That looks good.

And what about warning/preventing users to combine **kwargs, alias, suffix and prefix? (I personally prefer a warning than an exception)

@alexander-beedie alexander-beedie self-assigned this Jan 31, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 31, 2023

I do not really share this feeling, since .alias serves to explicitly rename the (one-dimensional) output of an expression.

It's not a strongly-held opinion, and it's hard to argue with consistency ;)

@ritchie46: I'll assign myself this one and start moving structify behaviour behind a Config option tonight/tomorrow. Let me know if you want any tweaks to the suggested approach above.

@ritchie46
Copy link
Member

I think that's a good idea @alexander-beedie. This behavior needs some landing. I like the idea of it I must say. It really fits nicely in our expression expansion magic.

But I want to add this behavior on the rust side and I am not yet ready for the potential bugs adding this brings. First want to finish other stuff.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 1, 2023

But I want to add this behavior on the rust side and I am not yet ready for the potential bugs adding this brings. First want to finish other stuff.

Yup, shouldn't be Python-only. I'll make the Config option an environment var rather than a local attribute so that it can be available on the Rust side later, for dev/staging. Can allow Python opt-in in the meantime, to get a better handle on the behaviour 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
3 participants