Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): automagically upconvert with_columns kwarg expressions with multiple output names to struct; extend **named_kwargs support to select #6497

Merged

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 27, 2023

Closes #6486, removes the "experimental" status from with_columns kwargs, and adds the same capability to select.

Example:

df = pl.DataFrame(
    data = {
        "x1": [1,2,6],
        "x2": [1,2,3],
    }
).with_columns(
    pct_change = pl.col(["x1","x2"]).pct_change(),
)
# ┌─────┬─────┬────────────────┐
# │ x1  ┆ x2  ┆ pct_change     │
# │ --- ┆ --- ┆ ---            │
# │ i64 ┆ i64 ┆ struct[2]      │
# ╞═════╪═════╪════════════════╡
# │ 1   ┆ 1   ┆ {null,null}    │
# │ 2   ┆ 2   ┆ {1.0,1.0}      │
# │ 6   ┆ 3   ┆ {2.0,0.5}      │
# └─────┴─────┴────────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 27, 2023
@alexander-beedie
Copy link
Collaborator Author

@ritchie46: this is actually really nice behaviour... :)
Time to bring with_columns kwargs into the light for 0.16 and remove the experimental flag?

@ritchie46
Copy link
Member

ritchie46 commented Jan 27, 2023

Does this work with pl.col(DATA_TYPE) and pl.col(REGEX)? How does this workout with pl.some_expanding_expr().suffix() and keep_name.

I think we need to add a function on the rust side of py-polars that will be called if **kwargs are used and checks if this expression is eligible for this, I must agree, nice feature.

We should also add some extra tests here that use lazy + projection / struct field expansion. Lazy must know the schema at all times.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 27, 2023

Does this work with pl.col(DATA_TYPE) and pl.col(REGEX)? How does this workout with pl.some_expanding_expr().suffix() and keep_name.

Works like a charm for DATA_TYPE ...

df = pl.DataFrame( {"x1":[1,2,6], "x2":[1,2,3]} ).with_columns( ints=pl.col(pl.Int64) )
# ┌─────┬─────┬───────────┐
# │ x1  ┆ x2  ┆ ints      │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ i64 ┆ struct[2] │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ 1   ┆ {1,1}     │
# │ 2   ┆ 2   ┆ {2,2}     │
# │ 6   ┆ 3   ┆ {6,3}     │
# └─────┴─────┴───────────┘

... and for "expanding" calls (with suffix or keep_name) ...

df = pl.DataFrame( {"x1":[1,2,6], "x2":[1,2,3]} ).with_columns( mins=pl.all().min().suffix("_min") )
# ┌─────┬─────┬───────────┐
# │ x1  ┆ x2  ┆ mins      │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ i64 ┆ struct[2] │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ 1   ┆ {1,1}     │
# │ 2   ┆ 2   ┆ {1,1}     │
# │ 6   ┆ 3   ┆ {1,1}     │
# └─────┴─────┴───────────┘
df.unnest('mins')
# ┌─────┬─────┬────────┬────────┐
# │ x1  ┆ x2  ┆ x1_min ┆ x2_min │
# │ --- ┆ --- ┆ ---    ┆ ---    │
# │ i64 ┆ i64 ┆ i64    ┆ i64    │
# ╞═════╪═════╪════════╪════════╡
# │ 1   ┆ 1   ┆ 1      ┆ 1      │
# │ 2   ┆ 2   ┆ 1      ┆ 1      │
# │ 6   ┆ 3   ┆ 1      ┆ 1      │
# └─────┴─────┴────────┴────────┘

... but doesn't for REGEX (as the expr doesn't know if it will match one or more cols until evaluated).

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 27, 2023

I think we need to add a function on the rust side of py-polars

You read my mind ;) I don't like relying on the ComputeError to detect multiple output names - being able to use something like expr.meta.has_multiple_outputs() would feel much more solid.

@ritchie46
Copy link
Member

... but doesn't for REGEX (as the expr doesn't know if it will match one or more cols until evaluated).

Then I think we must follow up with that checking function and then we can make it default, I think.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 27, 2023

Side-note: how does one select column names that looks like regexes? 🤣

df = pl.DataFrame( {"^x.z$": [1,2,3]} )
# shape: (3, 1)
# ┌───────┐
# │ ^x.z$ │
# │ ---   │
# │ i64   │
# ╞═══════╡
# │ 1     │
# │ 2     │
# │ 3     │
# └───────┘

df.select("^x.z$")
# shape: (0, 0)
# ┌┐
# ╞╡
# └┘

@ritchie46
Copy link
Member

Side-note: how does one select column names that looks like regexes? rofl

df = pl.DataFrame( {"^x.z$":[1, 2, 3]} )
# shape: (3, 1)
# ┌───────┐
# │ ^x.z$ │
# │ ---   │
# │ i64   │
# ╞═══════╡
# │ 1     │
# │ 2     │
# │ 3     │
# └───────┘

df.select("^x.z$")
# shape: (0, 0)
# ┌┐
# ╞╡
# └┘

Those column names are not allowed in polars. ;) Or you must use a regex escape as fallback.

The meta utilities are coming up btw.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 27, 2023

Those column names are not allowed in polars. ;) Or you must use a regex escape as fallback.

Shall we actively prevent them? Could easily detect/raise inside col.
Probably a good idea, given the above, heh.

The meta utilities are coming up btw.

Fantastic :)

@ritchie46
Copy link
Member

Those column names are not allowed in polars. ;) Or you must use a regex escape as fallback.

Shall we actively prevent them? Could easily detect/raise inside col. Probably a good idea, given the above, heh.

The meta utilities are coming up btw.

Fantastic :)

Yeah, I could add an assert there.

@gab23r
Copy link
Contributor

gab23r commented Jan 27, 2023

Why do not automatically convert expression with multiple names into a struct when we used alias so that :

df = pl.DataFrame( {"x1":[1,2,6], "x2":[1,2,3]} ).with_columns(pl.col(pl.Int64).alias('ints'))

would work as well ?

I think this is a nice feature, It could be extend to other functions like select for example

@alexander-beedie alexander-beedie force-pushed the structify-multioutput-kwargs branch 3 times, most recently from 5533bc7 to d6451d5 Compare January 28, 2023 01:18
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 28, 2023

Why do not automatically convert expression with multiple names into a struct when we used alias so that :

Hmm... It doesn't feel as intentional?
(Though perhaps my feeling there will change as more people discover this method ;)

I think this is a nice feature, It could be extend to other functions like select for example

Update: Done

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small remark.

Can we also add a structifying example to with_columns. This behavior deserves some documentation. :)

py-polars/polars/internals/lazyframe/frame.py Show resolved Hide resolved
@alexander-beedie
Copy link
Collaborator Author

Can we also add a structifying example to with_columns. This behavior deserves some documentation. :)

A very good point :)
Will take care of that in a few hours when I'm back.

@alexander-beedie alexander-beedie force-pushed the structify-multioutput-kwargs branch 3 times, most recently from b3cab2e to bef9da0 Compare January 28, 2023 16:25
@alexander-beedie
Copy link
Collaborator Author

Done; added an extra docstring example/explanation, squashed, rebased...👌

@ritchie46
Copy link
Member

Almost there @alexander-beedie. Maybe I missed it, but I believe we don't yet have added this functionality to select?

@alexander-beedie
Copy link
Collaborator Author

Ahh, you didn't miss it - apparently I missed that we were extending it there (though it makes perfect sense to do so!).

I'll be at the airport shortly - will see if there's time to do it before boarding - and, if not, I'm still hoping it might be possible to do a commit from 10km up 😂

@ritchie46
Copy link
Member

Ahh, you didn't miss it - apparently I missed that we were extending it there (though it makes perfect sense to do so!).

Yeah, I want to keep those two consistent.

There is no hurry. The 0.16 release will still take few days. Have a good flight!

@alexander-beedie alexander-beedie changed the title feat(python): automagically upconvert with_columns kwarg expressions with multiple output names to struct feat(python): automagically upconvert with_columns kwarg expressions with multiple output names to struct; extend **named_kwargs support to select Jan 29, 2023
@ritchie46
Copy link
Member

Alright. Here goes! 💯

@ritchie46 ritchie46 merged commit 45667c1 into pola-rs:master Jan 29, 2023
cojmeister pushed a commit to cojmeister/polars that referenced this pull request Jan 30, 2023
…s with multiple output names to struct; extend `**named_kwargs` support to `select` (pola-rs#6497)

Co-authored-by: Ritchie Vink <ritchie46@gmail.com>
@alexander-beedie alexander-beedie deleted the structify-multioutput-kwargs branch January 30, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Confusing (& wrong) behavior when using with_columns incorrectly
3 participants