Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add map_dict expression. #5899

Merged
merged 2 commits into from
Feb 10, 2023

Conversation

ghuls
Copy link
Collaborator

@ghuls ghuls commented Dec 23, 2022

No description provided.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Dec 23, 2022
@ghuls
Copy link
Collaborator Author

ghuls commented Dec 23, 2022

WIP.

Not sure about the name remap
Maybe map_dictor something like that, might be better.
Suggestions for better method argument names are welcome too.

TODO:

  • Add tests
  • Dispatch to Series
  • Add methods to documentation
  • Add fast path if no default argument is set (or is None).

#2377 would have been useful to have.
Now I had to use:

.with_column(pli.lit(True).alias(is_remapped_column)),
...
pli.when(pli.col(is_remapped_column).is_not_null())
```

@zundertj
Copy link
Collaborator

zundertj commented Dec 23, 2022

I think map_dict is a better name, as it tells you that you are doing a map, with a dict as input, in contrast to the current map which takes in any callable. remap could also be about updating column names or something.

The nicest feature, but that is probably much more work, is if we could get this as an expression as well. Currently, you have four parameters, only two would be needed for that:

        column: str,
        remapping: dict[Any, Any],
        default: Any = None,
        new_column: str | None = None,
pl.col("input_col").map_dict({"hello": 1, "world", 2}, default="default text here").alias("output_col")

As for the naming:

@ghuls
Copy link
Collaborator Author

ghuls commented Dec 23, 2022

I also think map_dict might be better.
I think, dict is a reserved keyword, so overriding it is not great. mapping might be already better.

fill_value might indeed be better than default. Altought with an pl.col(X) value as a default value, fill_value might look weird.

Instead of new_column (recycled from read_csv) , alias would be another option. Although the use of alias in polars syntax was always slightly weird for me, but would be more consistent.

I like the following syntax more, but as we need join, I don't know how to get the whole dataframe from the column expression. I assume over needs access to the whole dataframe too, so it might be possible.

pl.col("input_col").map_dict({"hello": 1, "world", 2}, default="default text here").alias("output_col")

@gab23r
Copy link
Contributor

gab23r commented Dec 23, 2022

Is it reasonable to say that map could take either a Callable or a dictionary? Then the default value could be handle via fill_null and new column via alias

@Julian-J-S
Copy link
Contributor

First of all thanks for working on this <3

Naming:
I think map_dict + mapping + default is fine (map would be great but that is already taken and I guess overloading is not possible?)

Implementation:
I don't know the implementation details but I wonder two things:

  1. why is this implemented on the DataFrame? Not sure if/why this is needed but as an end-user I see use cases for expressions (pl.col("a").map_dict(d)) and for series (s.map_dict(d)). Pandas also only has this on Series and Index and not on the DataFrame
  2. If this is implemented on DataFrame I think we should remove column and new_column parameters and apply the mapping to all columns. This would be consistent with other methods on DataFrame like map or max that apply to all columns. (On DataFrame: all columns, On Series/Expression: only specified columns)

What do you think?

@zundertj
Copy link
Collaborator

zundertj commented Dec 24, 2022

So we have various map methods & functions currently:

  1. LazyFrame.map => take in callable, return lazy frame
  2. Expr.map => takes in callable, return expression
  3. pl.map => like, Expr.map, but with the expression as first argument (rather than object), and no agg_list parameter

DataFrame.map has not even been taken yet (to my surprise). If we are going with map_dict, actually overriding map by passing allowing one to pass in a dict rather instead of a callable seems not such a large step. Operating on a dict is basically operating on the callable dict.get, but ideally in an optimized way.

@ritchie46
Copy link
Member

@ghuls could we do this with a map_dict expression? I don't feel this should be a method an the DataFrame/LazyFrame.

The function that gets the Series could turn this into a DataFrame and do the join on that internally. Or am I missing a complicating factor?

@ghuls
Copy link
Collaborator Author

ghuls commented Dec 27, 2022

@ghuls could we do this with a map_dict expression? I don't feel this should be a method an the DataFrame/LazyFrame.

The function that gets the Series could turn this into a DataFrame and do the join on that internally. Or am I missing a complicating factor?

I forgot about the possibility of making a DataFrame from a Series. It should indeed work.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 2, 2023

Definitely feels like an expression; also, rather than continuing to overload the name map, how about a clean break and call the expression switch, to declare what is done (switching values), rather map_dict which (at least to me 😅) feels more like it declares how it is done? (with a dict).

This could be my C++ past coming back to haunt me, but (structurally) it follows exactly same pattern, so there is a history/precedent for the name...

#include <iostream>
using namespace std;

// trivial example 
int main () {
   char country_code = 'FR';

   switch( country_code ) {
      case 'FR' :
         cout << "France" << endl; 
         break;
      case 'ES' :
         cout << "Spain" << endl;
         break;
      default :
         cout << "Unknown" << endl;
   }
   return 0;
}

...and as a polars expression:

pl.col('country_code').switch( 
  {'FR':'France', 'ES':'Spain'}, default='Unknown'
)

If not, I think pl.col('xyz').remap probably does looks cleaner than pl.col('xyz').map_dict, but that's not a hill I'd die on... ;)

@ghuls
Copy link
Collaborator Author

ghuls commented Jan 2, 2023

Yes, it should be an expression. At the time I couldn't figure out how to do it due the use of join (forgot that you can transform a series to a dataframe).

I just looked for some synonyms:
https://www.wordhippo.com/what-is/another-word-for/switch.html

swap might be a nice and clear one.
I also likesubstitute, but it is a bit long and substitution feels to closely to substring subsitution replacement .

@thomasfrederikhoeck
Copy link
Contributor

Yes, it should be an expression. At the time I couldn't figure out how to do it due the use of join (forgot that you can transform a series to a dataframe).

I just looked for some synonyms: https://www.wordhippo.com/what-is/another-word-for/switch.html

swap might be a nice and clear one. I also likesubstitute, but it is a bit long and substitution feels to closely to substring subsitution replacement .

I agree that swap is very clear for the once were there is a key but for for the ones with no match you would expect swap to keep the original. So remap might be better.

@ritchie46
Copy link
Member

@ghuls any updates on this one? I can take it from here if you like.

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 9, 2023

@ghuls any updates on this one? I can take it from here if you like.

Looking at it again.

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 9, 2023

@ritchie46 I was looking back in Discord to check how to implement remap as an expression instead of on the dataframe, but it looks like currently it is not possible according to your comment at the time:

Getting the df in an expression is a bit harder.
It would need a dedicated expression.
In the rust physical plan

Or is there another way? (I tried pl.select(self), but that does not seem to work).

@ritchie46
Copy link
Member

Create a map function that accepts a Series. The Series can convert to a DataFrame when the mapped function is executed.

@ritchie46
Copy link
Member

Thanks @ghuls. I took your code and converted it to an expression. I don't think it should be on the frames. This also removes the need for an output column name.

@ritchie46 ritchie46 changed the title feat(python): Add remap method. feat(python): Add map_dict expression. Feb 10, 2023
@ritchie46 ritchie46 merged commit 843a153 into pola-rs:master Feb 10, 2023
@ghuls
Copy link
Collaborator Author

ghuls commented Feb 10, 2023

Ah, thanks. I was just working on the same. I also managed to get it to work if you select another column as default.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 10, 2023

Thanks @ghuls. I took your code and converted it to an expression. I don't think it should be on the frames. This also removes the need for an output column name.

@ritchie46: Can we change the name before release? I think remap was looking like the favourite (though I'd still vote for switch, heh)

@ritchie46
Copy link
Member

I think map_dict is most explicit. I would need to lookup remap and be wondering how it relates to map itself.

@alexander-beedie
Copy link
Collaborator

Fair enough; I do not feel strongly about this, I'm just happy the feature exists :)

(Thx again @ghuls!)

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 10, 2023

Ah, thanks. I was just working on the same. I also managed to get it to work if you select another column as default.

Using other column expressions than the column that is map_dict-ed for the default value, is possible with:
#6781

@sm-Fifteen
Copy link

sm-Fifteen commented Feb 10, 2023

Regarding the name, I had suggested lookup (possibly dict_lookup) in #3789, but IMO map_dict says everything it needs to say. It really is just a map operation with a fixed dict instead of an arbitrary function, after all. I like it.

Being able to throw on missing value instead of defaulting could be practical in some cases, but this is already merged, so I guess it would have to be for a separate feature PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants