Change default behavior `map_dict` #10755

ion-elgreco · 2023-08-28T11:29:21Z

Problem description

The current default behavior of map_dict is to only replace the matching values, if it's not in the dict, it will return Nones. With the possible renaming to replace_values #10744 (comment), setting the default parameter, to pl.first() would make more sense. default=pl.first(). Then if you want to explicitly replace only the ones in the dict, you can pass None to default.

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-08-28T12:59:30Z

Agree here. An alternative would be to allow passing of a defaultdict object, although I agree that pl.first() as the default behavior makes the most sense, as it's the user case 99% of the time.

lyngc · 2023-08-28T15:08:05Z

Disagree. If you didn’t specify what to map, it’s undefined and should default to None instead of implicitly doing something

zundertj · 2023-08-28T16:52:31Z

I also dislike the implicitness of default value to be pl.first(), or maybe I end up always in the 1% use case where I want it to not map if there is no value (with setting pl.first() or using a Python defaultdict as the ways to define other behaviour).

Moreover, I think also the naming is quite poor in this context. I get that it refers to the first as in "current value comes first, mapped value second" to keep the current value (i.e. not map), but alternatively, it could also mean "grab the first value in the column" or "first value in the dict" in case of no mapping. From that perspective, I like the more explicit nature of doing this (per the docstring:

df.with_columns(
    pl.col("country_code")
    .map_dict(country_code_dict, default=pl.col("country_code"))
)

over:

df.with_columns(
    pl.col("country_code")
    .map_dict(country_code_dict, default=pl.first())  # dont think here "first country in df" or "first country in the dict"
)

So I would even vote for removing pl.first() from the docstring; Polars expressions are powerful enough that no "magic" (again: in this context, not in terms of underlying represenation) expression is needed.

mcrumiller · 2023-08-28T17:12:23Z

I view map_values as "map these values I provided in the dict, and don't map the other values." I usually use it when I want to rename certain items to a common naming schema, but that's my most common use case and not necessarily everyone else's.

I think also the naming is quite poor in this context

I agree here completely, I understand that algorithmically it picks the first value, but semantically it's a little odd, since we're essentially mapping one set to another so the first x and last x are the same element. I think default="keep" would be better, or maybe a map_values(subset={....}) to show that you only want to map a subset of the values.

ion-elgreco · 2023-08-28T17:34:54Z

I made small error in opening the issue. It was actually suggested in the other issue on renaming apply and map, to change this to replace_values, hence the reasoning to then not return Nulls when there is no match since the name implies a replace.

Wainberg · 2023-08-29T01:32:10Z

Yes please! This is a common source of bugs in my experience.

cmdlineluser · 2023-09-01T21:23:40Z

Even though they are closely related, is there an argument for just having a separate "replace" function?

The most recent example from SO: "Replace inf with null in numeric columns".

https://stackoverflow.com/questions/77019754/

df = pl.DataFrame({"label": ["a", "b", "c"],
                "ratio_a": [1.0, 2.0, np.inf],
                "ratio_b": [10.0, np.inf, 20.0],
                "ratio_c": [3.0, 200, 300],
                "ratio_d": [-1.0, -2.0, np.inf],
                })

# explict when/then/otherwise        
df.with_columns(
   pl.when(cs.numeric() == float('inf')).then(None).otherwise(cs.numeric()).keep_name()
)

# .map_dict / default=pl.first()
df.with_columns(
   cs.numeric().map_dict({float('inf'): None}, default=pl.first())
)

As the replacement value is null in this case, we could use the implicit .otherwise(None) which simplifies things:

# implicit otherwise(None)
df.with_columns(
   pl.when(cs.numeric() != float('inf')).then(cs.numeric())
)

It seems like a common enough operation where it would be nice to be able to just:

df.with_columns(
   cs.numeric().replace(float('inf'), None)
)

i.e. being able to specify a single replacement, or pass a dict mapping, and the function can call .map_dict internally.

def replace(self, old_or_mapping, new=None):
    if not isinstance(old_or_mapping, dict):
        ...
    self.map_dict(..., default=pl.first())

aleewen · 2023-09-04T02:06:01Z

Yes I agree with the others in that we need a Polars-equivalent to Pandas' .replace() function

sm-Fifteen · 2023-09-07T13:51:50Z

Ok, so what are the use cases we're trying to cover, here? Because I think that might help to clear up what the function should be named and if there should be any alternate ones.

Personally, I'm using this to map enum-type fields (finite set of valid values) to alternate representations (like atomic_element_col.map_dict(ATOMIC_MASS_DICT)), where
- it wouldn't make sense to use a function, since there's no mathematical relationship between the input and the output
- it would be invalid to have a value that's not in the dictionnary, and I would rather have an exception be thrown than a default value of any kind
- it can currently be done with a second-pass check, but that's an extra step for what's basically just an assertion and requires a sentinel value that's valid for the output dtype to work (which could be tricky if None is also in your lookup dictionnary)
- outputting the starting value on failed lookups would be especially undesirable
The current behavior, where failed lookups are expected to map to a single fallback value (like gender_col.map_dict({'M': 'him', 'F': 'her'}, default='them')), so that
- even is there was no a default argument available, the same behavior could be achieved with a defaultdict
- this is similar to Pandas' Series.map (when used with a dict argument), except using null values by default instead of outputting NaNs
What @cmdlineluser is suggesting would be a shorthand for replacing sets of undesirable values in one operation (such as num_col.map_dict({float('inf'): None, float('-inf'): None, float('nan'): None})), meaning
- all values not in the dict should be left as-is
- it can currently be done (in a somewhat unintuitive manner) with default=pl.first(), by the looks of things
- it intentionally disregards dtype and "mathematical comparison", such that NaN and the Infinities are equal to themselves (which is completely unlike how float-keyed dicts work in Python, and also would create weirdness around -0)
- it would be akin to Pandas' Series.replace()
- there might be some weirdness if the input and output dtypes are different, like if your input is a string and the output a number, but one failed lookup causes the resulting column to become a string or something

Is there anything I missed, and can we agree on any single behavior, or should there be separate functions to cover these use cases? I think the expected default behavior is going to heavily affect our perceptions of what the function should and shouldn't be named.

ion-elgreco · 2023-09-07T14:02:18Z

@sm-Fifteen I think you covered all of them. I personally use the last one the most, since in many cases I just need to replace some erroneous values with None or another value. While mapping existing values to new ones, so behavior 1 and 2 is less frequent.

Wainberg · 2023-09-07T14:14:22Z

Yeah, the third one is definitely the most common use case.

sm-Fifteen · 2023-09-07T14:19:04Z

@sm-Fifteen I think you covered all of them. I personally use the last one the most, since in many cases I just need to replace some erroneous values with None or another value. While mapping existing values to new ones, so behavior 1 and 2 is less frequent.

@ion-elgreco: It might make sense to split them, then. Having "dict_lookup" (cases 1 and 2) and "replace_values" (case 3) be separate functions, given the separate concerns and semantics. Personally I only use the first case, and that was the use case I was trying to cover when I opened issue #3789.

The value replacement behavior may also need to be clarified, in case it works differently in Rust than it does in Python:

>>> mydict = {float('-0'): False, float('nan'): ..., float('inf'): ..., float('-inf'): ...}
>>> mydict
{-0.0: False, nan: Ellipsis, inf: Ellipsis, -inf: Ellipsis}

>>> mydict[0]
False

>>> mydict[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

>>> mydict[float('inf')]
Ellipsis

(TIL IEEE 754 floats actually define Infinity as being equal to itself, unlike NaN)

ion-elgreco · 2023-09-07T14:25:53Z

@sm-Fifteen I think you covered all of them. I personally use the last one the most, since in many cases I just need to replace some erroneous values with None or another value. While mapping existing values to new ones, so behavior 1 and 2 is less frequent.

@ion-elgreco: It might make sense to split them, then. Having "dict_lookup" (cases 1 and 2) and "replace_values" (case 3) be separate functions, given the separate concerns and semantics. Personally I only use the first case, and that was the use case I was trying to cover when I opened issue #3789.

The value replacement behavior may also need to be clarified, in case it works differently in Rust than it does in Python:
>>> mydict = {float('-0'): False, float('nan'): ..., float('inf'): ..., float('-inf'): ...}
>>> mydict
{-0.0: False, nan: Ellipsis, inf: Ellipsis, -inf: Ellipsis}

>>> mydict[0]
False

>>> mydict[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

>>> mydict[float('inf')]
Ellipsis
(TIL IEEE 754 floats actually define Infinity as being equal to itself, unlike NaN)

That's an option. Thenreplace_values can just be some syntactic sugar for map_dict. @stinodego

cmdlineluser · 2023-11-15T06:24:31Z

While trying to understand a schema mismatch issue with .map_dict yesterday I added a version on the Rust side.

I needed a non-clashing name, so I used .map_keys as in "map all these values and use them as keys"

.map_dict() already turns the dict it into a DataFrame:

polars/py-polars/polars/expr/expr.py

Line 9316 in 23c367b

remap_frame = pl.LazyFrame(data=[remap_key_s, remap_value_s]).with_columns(

So I set it up that it checks if the mapping is already a 2-column dataframe, if not create one so we can also do pl.col("foo").map_keys(df_with_2_columns)

Is there a reason why the mapping is tied to being a dictionary even though it gets turned into a dataframe internally?

stinodego · 2023-11-21T06:46:27Z

Regarding the use cases specified by @sm-Fifteen:

I'd like to address the first use case separately. We will be expanding the Categorical type (perhaps adding a dedicated Enum type). There can be a separate namespace method for mapping those types, e.g. cat.map. This can introduce the desired behavior (constraints on the mapping).

Use cases 2 and 3 do feel similar enough to be covered by the same function, as they are now.

Name

Whatever we do, we have to rename map_dict as the name just doesn't make sense. I guess it is meant to indicate "map values according to the given dictionary". But even so, the input shouldn't need to be a dictionary, any mapping type should work. So really it's named map_map which sounds like something an extraterrestrial creature might say.

So map would make sense, since we're mapping values. However, it's taken. map in Polars indicates use of UDFs.

An alternative which I like is replace. We're replacing one value by another. So that's the one I think we should go for.

Behavior

Regarding the default behavior of values not in the specified map: I strongly believe the default behaviour should be to keep unspecified values intact.

If I have values 1 and 2, and I replace value 2 by 3, I end up with values 1 and 3. Not null and 3.

So the default will become "keep existing value" (behind the scenes this is pl.first()). If any other value is desired, it can be passed to the default argument.

sm-Fifteen · 2023-11-21T18:02:05Z

I'd like to address the first use case separately. We will be expanding the Categorical type (perhaps adding a dedicated Enum type). There can be a separate namespace method for mapping those types, e.g. cat.map. This can introduce the desired behavior (constraints on the mapping).

That seems reasonnable. The first case is indeed operating with the assumption of a finite set of possible values, with the assumption that all accepted values are keys in the dictionnary and the mapping is complete, which would make sense as an operation on categorical columns.

As for the other 2 cases, mimicking Pandas' replace API with a default argument (and/or possibly accepting a DefaultDict) seems like a pretty good way of covering everything.

ion-elgreco added the enhancement New feature or an improvement of an existing feature label Aug 28, 2023

ion-elgreco mentioned this issue Aug 28, 2023

Rename all apply functions to map_* #10744

Closed

9 tasks

cmdlineluser mentioned this issue Nov 17, 2023

Add replace to replace values (like pandas) #12537

Closed

stinodego added the accepted Ready for implementation label Nov 21, 2023

stinodego self-assigned this Nov 21, 2023

github-project-automation bot added this to Backlog Nov 21, 2023

github-project-automation bot moved this to Ready in Backlog Nov 21, 2023

This was referenced Nov 21, 2023

Rename .map_dict to .map_keys #12533

Closed

depr(python): Rename map_dict to replace and change default behavior #12599

Merged

stinodego mentioned this issue Nov 22, 2023

Add Categorical namespace method for mapping categories to a new set of categories #12627

Open

ritchie46 closed this as completed in #12599 Nov 22, 2023

github-project-automation bot moved this from Ready to Done in Backlog Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default behavior `map_dict` #10755

Change default behavior `map_dict` #10755

ion-elgreco commented Aug 28, 2023 •

edited

Loading

mcrumiller commented Aug 28, 2023

lyngc commented Aug 28, 2023

zundertj commented Aug 28, 2023

mcrumiller commented Aug 28, 2023

ion-elgreco commented Aug 28, 2023

Wainberg commented Aug 29, 2023

cmdlineluser commented Sep 1, 2023

aleewen commented Sep 4, 2023

sm-Fifteen commented Sep 7, 2023

ion-elgreco commented Sep 7, 2023

Wainberg commented Sep 7, 2023

sm-Fifteen commented Sep 7, 2023

ion-elgreco commented Sep 7, 2023

cmdlineluser commented Nov 15, 2023

stinodego commented Nov 21, 2023

sm-Fifteen commented Nov 21, 2023

Change default behavior map_dict #10755

Change default behavior map_dict #10755

Comments

ion-elgreco commented Aug 28, 2023 • edited Loading

Problem description

mcrumiller commented Aug 28, 2023

lyngc commented Aug 28, 2023

zundertj commented Aug 28, 2023

mcrumiller commented Aug 28, 2023

ion-elgreco commented Aug 28, 2023

Wainberg commented Aug 29, 2023

cmdlineluser commented Sep 1, 2023

aleewen commented Sep 4, 2023

sm-Fifteen commented Sep 7, 2023

ion-elgreco commented Sep 7, 2023

Wainberg commented Sep 7, 2023

sm-Fifteen commented Sep 7, 2023

ion-elgreco commented Sep 7, 2023

cmdlineluser commented Nov 15, 2023

stinodego commented Nov 21, 2023

Name

Behavior

sm-Fifteen commented Nov 21, 2023

Change default behavior `map_dict` #10755

Change default behavior `map_dict` #10755

ion-elgreco commented Aug 28, 2023 •

edited

Loading