-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for map_batches / map_elements / map_groups / map_rows is confusing #14521
Comments
Good points @soerenwolfers - I'd be happy to help if you want to make a PR |
I would need to be certain about what's going first, and I think beyond playing with the functions to see what's currently possible I would also need some guidance on what the intentions are. For example, I don't currently understand why there'd be more than
(To avoid having to pack multiple columns into a single struct column, both of these could also be allowed to be applied on entire dataframes (e.g., In particular, I would only have two names instead of four. |
I can't speak for the devs, but I don't think there's any drive to change the names or API, I think improving the documentation would be enough |
Yeah I get that, but I for one cannot improve the documentation because I don't know what's going on. |
There is also the top-level
import json
import polars as pl
df = pl.DataFrame(dict(
group=[1, 2, 3, 1, 2],
value=[1, 2, 3, 4, 5]
))
df.with_columns(
elements = pl.col("value").map_elements(lambda x: json.dumps(list(x))).over("group"),
groups = pl.map_groups("value", lambda x: json.dumps(list(x[0]))).over("group")
) # shape: (5, 4)
# ┌───────┬───────┬──────────┬────────┐
# │ group ┆ value ┆ elements ┆ groups │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str ┆ str │
# ╞═══════╪═══════╪══════════╪════════╡
# │ 1 ┆ 1 ┆ [1, 4] ┆ [1, 4] │
# │ 2 ┆ 2 ┆ [2, 5] ┆ [2, 5] │
# │ 3 ┆ 3 ┆ [3] ┆ [3] │
# │ 1 ┆ 4 ┆ [1, 4] ┆ [1, 4] │
# │ 2 ┆ 5 ┆ [2, 5] ┆ [2, 5] │
# └───────┴───────┴──────────┴────────┘ Should there not just be |
On top of this, there seem to be 3 flavors of Why Why isn't there According to documentation, Also, why isn't there I would concur with the OP: |
Description
Summary
The functionality split between
map_batches
/map_elements
/map_groups
/map_rows
is complicated enough that it warrants having completely precise and rich-of-examples documentation.(Personally, I feel the API behind the polars functions itself might use some improvement as well, but that's another ticket, and not one that I as a novice should make claims about.)
Below, I list some specific gripes with the current documentation of the individual functions.
However, I think beyond fixing these, it'd be useful to have some joint documentation of the differences and use-cases of each of the various UDF functions. For example, I really like numpy's documentation of (generalized) ufuncs at https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html in comparison.
map_elements
Documentation for map_elements
Confusingly,
map_elements
has a name that very clearly suggests that the function here is applied pointwise, or at least should operate in a pointwise fashion, yet the documentation indicates that it can be used to emulate grouped folds, such asgroup_by().cum_sum()
, given thataccording to the documentation:
I think it should be more clearly highlighted that
map_elements
is allowed in a non-pointwise fashion in group-by contexts (or that it is not, in case this is an implementation detail that may change)map_groups
Documentation for map_groups
map_elements
can also apply a UDF in a GroupBy context as per it's own documentation"function"
parameter except that it contradicts what's in the signature, i.e., that it is fed a Sequence of Series.map_batches
Documentation for map_batches
This is the worst.
which is confusing because it sounds like the same that
map_groups
andmap_elements
do in a group by context, and the author says at #12941 thatwhich seems to contradict the claim that the function is applied to "whole Series".
Being new to polars, I don't fully understand what this means (that's not a complaint, issues don't have to use language for beginners), but confusingly in another place he seems to contradict (but this may just be my misunderstanding) even that, saying that
map_batches
doesn't even get fed complete "batches":yet the description of the
agg_list
parameter saysand half the documentation is spent discussing how to use that parameter in a group by context.
map_batches
imposes on itsfunction
arguments, and whether that contract depends on the context in which it is used. That seems much more important to know than implementation details about what the function is applied to in the end. Same goes for all three really.Link
No response
The text was updated successfully, but these errors were encountered: