Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for map_batches / map_elements / map_groups / map_rows is confusing #14521

Open
soerenwolfers opened this issue Feb 15, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@soerenwolfers
Copy link

soerenwolfers commented Feb 15, 2024

Description

Summary

The functionality split between map_batches / map_elements / map_groups / map_rows is complicated enough that it warrants having completely precise and rich-of-examples documentation.
(Personally, I feel the API behind the polars functions itself might use some improvement as well, but that's another ticket, and not one that I as a novice should make claims about.)

Below, I list some specific gripes with the current documentation of the individual functions.

However, I think beyond fixing these, it'd be useful to have some joint documentation of the differences and use-cases of each of the various UDF functions. For example, I really like numpy's documentation of (generalized) ufuncs at https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html in comparison.

map_elements

Documentation for map_elements

Confusingly, map_elements has a name that very clearly suggests that the function here is applied pointwise, or at least should operate in a pointwise fashion, yet the documentation indicates that it can be used to emulate grouped folds, such as group_by().cum_sum(), given that
according to the documentation:

GroupBy
Expects function to be of type Callable[[Series], Any]. For each group, applies a Python function to the slice of the column corresponding to that group.

I think it should be more clearly highlighted that map_elements is allowed in a non-pointwise fashion in group-by contexts (or that it is not, in case this is an implementation detail that may change)

map_groups

Documentation for map_groups

  • The description says nothing but "Apply a custom/user-defined function (UDF) in a GroupBy context." which leaves the question why this function exists, given that map_elements can also apply a UDF in a GroupBy context as per it's own documentation
  • The documentation says nothing about the "function" parameter except that it contradicts what's in the signature, i.e., that it is fed a Sequence of Series.

map_batches

Documentation for map_batches

This is the worst.

  • First, nowhere in the documentation is it even hinted at what a "batch" is. Instead, the documentation starts by saying
Apply a custom python function to a whole Series or sequence of Series.

which is confusing because it sounds like the same that map_groups and map_elements do in a group by context, and the author says at #12941 that

A batch by definition doesn't include a whole group. E.g. if you take a batch from a data set and you sort that batch, you shouldn't assume the whole dataset is sorted.

which seems to contradict the claim that the function is applied to "whole Series".
Being new to polars, I don't fully understand what this means (that's not a complaint, issues don't have to use language for beginners), but confusingly in another place he seems to contradict (but this may just be my misunderstanding) even that, saying that map_batches doesn't even get fed complete "batches":

> Methods are applied to the object they are called on. So if I call map on a Series, I expect to map the whole Series.

This is not guaranteed. Expr.map may get an incomplete batch. This happens for instance in the streaming engine.
  • Next, the documentation says
Warning

If you are looking to map a function over a window function or group_by context, refer to `map_elements()` instead.

yet the description of the agg_list parameter says

Aggregate the values of the expression into a list before applying the function. This parameter only works in a group-by context

and half the documentation is spent discussing how to use that parameter in a group by context.

  • Finally, I have absolutely no idea what contract map_batches imposes on its function arguments, and whether that contract depends on the context in which it is used. That seems much more important to know than implementation details about what the function is applied to in the end. Same goes for all three really.

Link

No response

@soerenwolfers soerenwolfers added the documentation Improvements or additions to documentation label Feb 15, 2024
@braaannigan
Copy link
Collaborator

Good points @soerenwolfers - I'd be happy to help if you want to make a PR

@soerenwolfers
Copy link
Author

soerenwolfers commented Feb 16, 2024

I would need to be certain about what's going first, and I think beyond playing with the functions to see what's currently possible I would also need some guidance on what the intentions are.

For example, I don't currently understand why there'd be more than

  • On the one hand, "elementwise mappings", the functions supplied to which would have to operate element-/row-wise but are allowed to set a vectorized parameter that determines whether the supplied function is () -> () (accepts scalar, returns scalar) or (n) -> (n) (accepts series, returns series).
    That's what I would call map_rows.
    On expressions, the inputs would either be individual values of a single series (e.g. pl.col('a').map_rows(f, vectorized=False)) or arbitrarily sized (decided by implementation details) chunks of series (e.g. pl.col('a').map_rows(f, vectorized=True)).
  • On the other hand, "groupwise mappings", the functions supplied to which would not need to satisfy any contract except needing to be (n)->(n)(accepts series, returns series) in a window context and (n)->() (accepts series, returns scalar) in a groupby context. That's what I would call map_groups. On expressions (e.g., pol.col('a').map_groups(f).over('b') or df.group_by('b').agg(pl.col('a').map_groups(f))) the inputs would be chunks (corresponding to the entire respective window partition or group, or the entire series if applied without context) of a series.

(To avoid having to pack multiple columns into a single struct column, both of these could also be allowed to be applied on entire dataframes (e.g., df.map_rows(f, vectorized=False) or df.map_rows(f, vectorized=True) or df.group_by('b').map_groups(f) or df.map_groups(f).over('b')), but there is a choice to be made then whether to supply dataframes or series-of-row-type in that case, the best decision for which which will depend on polars philosophy/implementation details.)

In particular, I would only have two names instead of four.

@braaannigan
Copy link
Collaborator

I can't speak for the devs, but I don't think there's any drive to change the names or API, I think improving the documentation would be enough

@soerenwolfers
Copy link
Author

Yeah I get that, but I for one cannot improve the documentation because I don't know what's going on.

@cmdlineluser
Copy link
Contributor

There is also the top-level pl.map_groups (and map_batches, but no "elements")

.map_elements changes behaviour depending on the context i.e. it is "map_rows" in select context but inside a groupby context it is the same as .map_groups (without the outer [])?

import json
import polars as pl

df = pl.DataFrame(dict(
    group=[1, 2, 3, 1, 2], 
    value=[1, 2, 3, 4, 5]
))

df.with_columns(
    elements = pl.col("value").map_elements(lambda x: json.dumps(list(x))).over("group"),
    groups   = pl.map_groups("value", lambda x: json.dumps(list(x[0]))).over("group")
)
# shape: (5, 4)
# ┌───────┬───────┬──────────┬────────┐
# │ group ┆ value ┆ elements ┆ groups │
# │ ---   ┆ ---   ┆ ---      ┆ ---    │
# │ i64   ┆ i64   ┆ str      ┆ str    │
# ╞═══════╪═══════╪══════════╪════════╡
# │ 1     ┆ 1     ┆ [1, 4]   ┆ [1, 4] │
# │ 2     ┆ 2     ┆ [2, 5]   ┆ [2, 5] │
# │ 3     ┆ 3     ┆ [3]      ┆ [3]    │
# │ 1     ┆ 4     ┆ [1, 4]   ┆ [1, 4] │
# │ 2     ┆ 5     ┆ [2, 5]   ┆ [2, 5] │
# └───────┴───────┴──────────┴────────┘

Should there not just be Expr.map_rows / Expr.map_groups and disallow .map_rows in a groupby context?

@soerenwolfers soerenwolfers changed the title Documentation for map_batches / map_elements / map_groups is confusing Documentation for map_batches / map_elements / map_groups / map_rows is confusing Feb 19, 2024
@vadimcn
Copy link

vadimcn commented Jul 2, 2024

On top of this, there seem to be 3 flavors of map_batches, all slightly different:

Why LazyFrame.map_batches takes a DataFrame, while others take a Series? Nevertheless, it seems that the DataFrame that the function is called with, always contains only one column?

Why isn't there streamable of is_elementwise parameter for polars.map_batches?

According to documentation, polars.map_batches may be applied to multiple expr's at once, but will the function be called with each series separately, or all at once (as multiple arguments)?

Also, why isn't there DataFrame.map_batches?

I would concur with the OP: map_... APIs are in a serious need of rationalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants