Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Inefficient map_*" warnings (tracking issue) #9968

Open
10 of 15 tasks
MarcoGorelli opened this issue Jul 19, 2023 · 20 comments
Open
10 of 15 tasks

"Inefficient map_*" warnings (tracking issue) #9968

MarcoGorelli opened this issue Jul 19, 2023 · 20 comments
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jul 19, 2023

@alexander-beedie let's start a new issue to keep track of ideas in this space

  • simple binary operations (e.g. lambda x: x+1)
  • logical conjunctions (and/or chains)
  • map_dict (e.g. lambda x: my_dict[x])
  • json.loads (e.g. lambda x: json.loads(x) or bare json.loads)
  • numpy functions which have expr equivalents (e.g. lambda x: np.sin(x) or bare np.sin)
  • string methods: uppercase, lowercase and title case
  • conditional logic (if/else)
  • list methods: lambda lst: ' '.join([str(x) for x in lst])
  • to_datetime: lambda x: dt.datetime.strptime(x, fmt), if possible (see this issue. Perhaps the warning could even suggest the pyarrow workaround - better than nothing, and much better than apply)
  • list lookups: lambda x: my_list[x] -> .map_dict({idx: val for idx, val in enumerate(my_list)})
  • .dt.month and other stdlib datetime functions
  • .str.strip_chars and other stdlib string functions
  • pl.col("step").map_elements(lambda x: range(x - step_size, x + step_size, 12))
  • sigmoid: lambda a: 1 / (1 + np.exp(-a))
  • str.replace
@MarcoGorelli MarcoGorelli added the python Related to Python Polars label Jul 19, 2023
@stinodego stinodego added the enhancement New feature or an improvement of an existing feature label Jul 19, 2023
@lucazanna
Copy link

Can we also include uppercase, lowercase and title case ? I have seen cases of using the Python string methods for those with apply instead of the Polars expressions

@alexander-beedie alexander-beedie changed the title Inefficient apply warnings: tracker "Inefficient apply" warnings (tracking issue) Jul 19, 2023
@alexander-beedie
Copy link
Collaborator

Aha! Got a super-clean approach for handling the numpy, string, and json identification/mapping now, though may take a day or two to get around to it... :)

@lucazanna
Copy link

lucazanna commented Jul 23, 2023

I just read an article with a code snippet using apply for a simple if condition. I wonder if ternary operators (value_if_true if condition else value_if_false) could also be included ?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 24, 2023

I just an article with a code snippet using apply for a simple if condition. I wonder if ternary operators (value_if_true if condition else value_if_false) could also be included ?

Got a sample? I need to rework/extend the current handling of and/or logic (which is represented by various *JUMP* control flow ops in the bytecode) so it can also handle if/else, which is represented similarly...

@lucazanna
Copy link

Got a sample? I need to rework/extend the current handling of and/or logic (which is represented by various *JUMP* control flow ops in the bytecode) so it can also handle if/else, which is represented similarly...

Yes, here is an example:

df.select(
    pl.col('a').apply(lambda x: x*2 if x>=5 else x)
)

https://towardsdatascience.com/manipulating-values-in-polars-dataframes-1087d88dd436

Unrelated to this - will the recommendation engine also work on small non-lambda Python functions?

@MarcoGorelli
Copy link
Collaborator Author

will the recommendation engine also work on small non-lambda Python functions?

you mean like this?

In [2]: def func(value):
   ...:     return value **2
   ...:

In [3]: df.select(pl.col('a').apply(func))
<ipython-input-3-53ff52cbccd5>:1: PolarsInefficientApplyWarning:
Expr.apply is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
In this case, you can replace your `apply` with an expression:
-  pl.col("a").apply(func)
+  pl.col("a") ** 2

  df.select(pl.col('a').apply(func))

If so, yup!

@lucazanna
Copy link

lucazanna commented Jul 24, 2023

@MarcoGorelli nice!

what about:

def func(value):
   if value > 10:
      return "a"
   elif value > 0:
      return "b"
   else:
      return "c"

I imagine it will also work when if and else are added?

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Jul 24, 2023

I expect that if the respective lambda function were to work, then that one should work as well - but we'll make sure to test for it explicitly, thanks!

EDIT: this would actually require a little extra work, as the corresponding lambda would be equivalent to

def func(x):
    return 'a' if x>10 else 'b' if x>0 else 'c'

so thanks for having brought it up

@ritchie46
Copy link
Member

ritchie46 commented Jul 24, 2023

Note that polars speculatively evaluates branches in when -> then -> otherwise, so if one of the branches can fail, the apply is correct way to deal with that.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 24, 2023

will the recommendation engine also work on small non-lambda Python functions?

Yes; it doesn't matter if you pass a lambda, function, or method on a class - they will all be disassembled down into the same primitive ops. However, I'm only considering single-return functions/lambdas, so multiple-return functions won't work (as they aren't quite the same thing as lambdas).

when if and else are added?

I'm liking the vote of confidence here 🤣 Control flow from bytecode can be tricky (and/or and if/else look quite similar since they both use flavours of *JUMP* ops and have to be disambiguated; starting to look into that, though will need some care to get it right) ;)

@lucazanna
Copy link

I'm liking the vote of confidence here 🤣

if there is one person who can I do it, I know it's you @alexander-beedie

@henryharbeck
Copy link
Contributor

Not sure if this is pushing it too far / asking for too much (apologies in advance if it is), but I did think of some potential niceties around conditionals. Will leave it to you all in terms of whether you think it is reasonable and/or feasible. Just putting thoughts out there.

If a function only has checks for equality with an optional else clause, then that could be translated to map_dict with a default argument (and leave off the default if there is no else)
E.g.,

df = pl.DataFrame({"gender": ["M", "F", "M", "X"]})

def long_gender(row):
    if row == "M":
        rv = "Male"
    elif row == "F":
        rv = "Female"
    else:
        rv = "Unknown"
    return rv

df.with_columns(
    pl.col("gender").apply(long_gender).alias("bad_way"),
    pl.col("gender").map_dict({"M": "Male", "F": "Female"}, default="Unknown").alias("good_way")
)

If a function only has a single numerical comparison operator (i.e. only one of <, <=, >, >=) with an else clause, then that could be translated to cut`. Some processing of the operators and inputs would probably be required to
E.g.,

df = pl.DataFrame({"score": range(1, 11)})

def grade_score_le(row):
    if row <= 5:
        rv = "Fail"
    elif row <= 7:
        rv = "Pass"
    else:
        rv = "Distinction"
    return rv

def grade_score_ge(row):
    if row >= 8:
        rv = "Distinction"
    elif row >= 6:
        rv = "Pass"
    else:
        rv = "Fail"
    return rv

# others operators omitted for brevity

df.with_columns(
    # both functions return the same thing, so can be translated into the same `cut`
    *(pl.col("score").apply(fn).alias(f"bad_way_{fn.__name__}") for fn in [grade_score_le, grade_score_ge]),
    (
        pl.when(pl.col("score") <= 5).then("Fail")
        .when(pl.col("score") <= 7).then("Pass")
        .otherwise("Distinction")
    ).alias("good_way"),
    pl.col("score").cut([5, 7], ["Fail", "Pass", "Distinction"]).alias("better_way")
)

Unsure of any other optimisations, but I'm guessing the general rule would be that conditionals would be translated to when/then/otherwise?

@lucazanna
Copy link

Added an issue here: #10210 for the if/else recommendation

@cmdlineluser
Copy link
Contributor

https://stackoverflow.com/questions/76822683/polars-apply-lambda-alternative

Could be an example/test-case for list lookups.

@henryharbeck
Copy link
Contributor

Hi @alexander-beedie, @MarcoGorelli,

I notice that the issue description mentions (and has ticked) both bare numpy function and those used with a lambda

  • numpy functions which have expr equivalents (e.g. lambda x: np.sin(x) or bare np.sin)

At the moment, the lambda does not seem to warn, but the bare call does.

Example:

df = pl.DataFrame({"a": [1, 4]})
df.select(pl.col("a").apply(lambda x: np.sin(x))) # no warning raised
df.select(pl.col("a").apply(np.sin)) # warning is raised

Flagging here as I'm unsure if this is an issue, or just hasn't been implemented yet.

@MarcoGorelli
Copy link
Collaborator Author

thanks for the report - this warns for me:

In [4]: df = pl.DataFrame({"a": [1, 4]})
   ...: df.select(pl.col("a").apply(lambda x: np.sin(x))) # no warning raised
<ipython-input-4-e464a21bac84>:2: PolarsInefficientApplyWarning:
Expr.apply is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
In this case, you can replace your `apply` with the following:
  - pl.col("a").apply(lambda x: ...)
  + pl.col("a").sin()

  df.select(pl.col("a").apply(lambda x: np.sin(x))) # no warning raised
Out[4]:
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.841471  │
│ -0.756802 │
└───────────┘

In [5]: pl.__version__
Out[5]: '0.18.11'

could you give you polars and python versions please?

@henryharbeck
Copy link
Contributor

Thanks for the quick response
Python: 3.11.4
Polars: 0.18.12
and if it makes any difference at all
numpy: 1.24.3

@MarcoGorelli
Copy link
Collaborator Author

thanks! can reproduce, fix (and failing test) incoming!

@MarcoGorelli
Copy link
Collaborator Author

@henryharbeck are you running this in IPython / Jupyter?

I think it's that they apply some modifications and end up producing slightly different bytecode

If you make a Python script with just the following:

import numpy as np
import polars as pl

df = pl.DataFrame({"a": [1, 4]})
df.select(pl.col("a").apply(lambda x: np.sin(x)))

, do you get the warning?

I do, but don't when running via IPython (in Python 3.11)

@henryharbeck
Copy link
Contributor

@MarcoGorelli, I was running it in Jupyter. Great stuff on figuring that out!

As a python script, the warning is produced. When running it as the first cell in a Jupyter notebook, no warning is produced.
Both are using the same venv with python 3.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

7 participants