Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default behavior map_dict #10755

Closed
ion-elgreco opened this issue Aug 28, 2023 · 16 comments · Fixed by #12599
Closed

Change default behavior map_dict #10755

ion-elgreco opened this issue Aug 28, 2023 · 16 comments · Fixed by #12599
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Aug 28, 2023

Problem description

The current default behavior of map_dict is to only replace the matching values, if it's not in the dict, it will return Nones. With the possible renaming to replace_values #10744 (comment), setting the default parameter, to pl.first() would make more sense. default=pl.first(). Then if you want to explicitly replace only the ones in the dict, you can pass None to default.

@ion-elgreco ion-elgreco added the enhancement New feature or an improvement of an existing feature label Aug 28, 2023
@mcrumiller
Copy link
Contributor

Agree here. An alternative would be to allow passing of a defaultdict object, although I agree that pl.first() as the default behavior makes the most sense, as it's the user case 99% of the time.

@lyngc
Copy link

lyngc commented Aug 28, 2023

Disagree. If you didn’t specify what to map, it’s undefined and should default to None instead of implicitly doing something

@zundertj
Copy link
Collaborator

I also dislike the implicitness of default value to be pl.first(), or maybe I end up always in the 1% use case where I want it to not map if there is no value (with setting pl.first() or using a Python defaultdict as the ways to define other behaviour).

Moreover, I think also the naming is quite poor in this context. I get that it refers to the first as in "current value comes first, mapped value second" to keep the current value (i.e. not map), but alternatively, it could also mean "grab the first value in the column" or "first value in the dict" in case of no mapping. From that perspective, I like the more explicit nature of doing this (per the docstring:

df.with_columns(
    pl.col("country_code")
    .map_dict(country_code_dict, default=pl.col("country_code"))
)

over:

df.with_columns(
    pl.col("country_code")
    .map_dict(country_code_dict, default=pl.first())  # dont think here "first country in df" or "first country in the dict"
)

So I would even vote for removing pl.first() from the docstring; Polars expressions are powerful enough that no "magic" (again: in this context, not in terms of underlying represenation) expression is needed.

@mcrumiller
Copy link
Contributor

I view map_values as "map these values I provided in the dict, and don't map the other values." I usually use it when I want to rename certain items to a common naming schema, but that's my most common use case and not necessarily everyone else's.

I think also the naming is quite poor in this context

I agree here completely, I understand that algorithmically it picks the first value, but semantically it's a little odd, since we're essentially mapping one set to another so the first x and last x are the same element. I think default="keep" would be better, or maybe a map_values(subset={....}) to show that you only want to map a subset of the values.

@ion-elgreco
Copy link
Contributor Author

I made small error in opening the issue. It was actually suggested in the other issue on renaming apply and map, to change this to replace_values, hence the reasoning to then not return Nulls when there is no match since the name implies a replace.

@Wainberg
Copy link
Contributor

Yes please! This is a common source of bugs in my experience.

@cmdlineluser
Copy link
Contributor

Even though they are closely related, is there an argument for just having a separate "replace" function?

The most recent example from SO: "Replace inf with null in numeric columns".

https://stackoverflow.com/questions/77019754/

df = pl.DataFrame({"label": ["a", "b", "c"],
                "ratio_a": [1.0, 2.0, np.inf],
                "ratio_b": [10.0, np.inf, 20.0],
                "ratio_c": [3.0, 200, 300],
                "ratio_d": [-1.0, -2.0, np.inf],
                })
# explict when/then/otherwise        
df.with_columns(
   pl.when(cs.numeric() == float('inf')).then(None).otherwise(cs.numeric()).keep_name()
)

# .map_dict / default=pl.first()
df.with_columns(
   cs.numeric().map_dict({float('inf'): None}, default=pl.first())
)

As the replacement value is null in this case, we could use the implicit .otherwise(None) which simplifies things:

# implicit otherwise(None)
df.with_columns(
   pl.when(cs.numeric() != float('inf')).then(cs.numeric())
)

It seems like a common enough operation where it would be nice to be able to just:

df.with_columns(
   cs.numeric().replace(float('inf'), None)
)

i.e. being able to specify a single replacement, or pass a dict mapping, and the function can call .map_dict internally.

def replace(self, old_or_mapping, new=None):
    if not isinstance(old_or_mapping, dict):
        ...
    self.map_dict(..., default=pl.first())

@aleewen
Copy link

aleewen commented Sep 4, 2023

Yes I agree with the others in that we need a Polars-equivalent to Pandas' .replace() function

@sm-Fifteen
Copy link

Ok, so what are the use cases we're trying to cover, here? Because I think that might help to clear up what the function should be named and if there should be any alternate ones.

  • Personally, I'm using this to map enum-type fields (finite set of valid values) to alternate representations (like atomic_element_col.map_dict(ATOMIC_MASS_DICT)), where
    • it wouldn't make sense to use a function, since there's no mathematical relationship between the input and the output
    • it would be invalid to have a value that's not in the dictionnary, and I would rather have an exception be thrown than a default value of any kind
    • it can currently be done with a second-pass check, but that's an extra step for what's basically just an assertion and requires a sentinel value that's valid for the output dtype to work (which could be tricky if None is also in your lookup dictionnary)
    • outputting the starting value on failed lookups would be especially undesirable
  • The current behavior, where failed lookups are expected to map to a single fallback value (like gender_col.map_dict({'M': 'him', 'F': 'her'}, default='them')), so that
    • even is there was no a default argument available, the same behavior could be achieved with a defaultdict
    • this is similar to Pandas' Series.map (when used with a dict argument), except using null values by default instead of outputting NaNs
  • What @cmdlineluser is suggesting would be a shorthand for replacing sets of undesirable values in one operation (such as num_col.map_dict({float('inf'): None, float('-inf'): None, float('nan'): None})), meaning
    • all values not in the dict should be left as-is
    • it can currently be done (in a somewhat unintuitive manner) with default=pl.first(), by the looks of things
    • it intentionally disregards dtype and "mathematical comparison", such that NaN and the Infinities are equal to themselves (which is completely unlike how float-keyed dicts work in Python, and also would create weirdness around -0)
    • it would be akin to Pandas' Series.replace()
    • there might be some weirdness if the input and output dtypes are different, like if your input is a string and the output a number, but one failed lookup causes the resulting column to become a string or something

Is there anything I missed, and can we agree on any single behavior, or should there be separate functions to cover these use cases? I think the expected default behavior is going to heavily affect our perceptions of what the function should and shouldn't be named.

@ion-elgreco
Copy link
Contributor Author

@sm-Fifteen I think you covered all of them. I personally use the last one the most, since in many cases I just need to replace some erroneous values with None or another value. While mapping existing values to new ones, so behavior 1 and 2 is less frequent.

@Wainberg
Copy link
Contributor

Wainberg commented Sep 7, 2023

Yeah, the third one is definitely the most common use case.

@sm-Fifteen
Copy link

@sm-Fifteen I think you covered all of them. I personally use the last one the most, since in many cases I just need to replace some erroneous values with None or another value. While mapping existing values to new ones, so behavior 1 and 2 is less frequent.

@ion-elgreco: It might make sense to split them, then. Having "dict_lookup" (cases 1 and 2) and "replace_values" (case 3) be separate functions, given the separate concerns and semantics. Personally I only use the first case, and that was the use case I was trying to cover when I opened issue #3789.

The value replacement behavior may also need to be clarified, in case it works differently in Rust than it does in Python:

>>> mydict = {float('-0'): False, float('nan'): ..., float('inf'): ..., float('-inf'): ...}
>>> mydict
{-0.0: False, nan: Ellipsis, inf: Ellipsis, -inf: Ellipsis}

>>> mydict[0]
False

>>> mydict[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

>>> mydict[float('inf')]
Ellipsis

(TIL IEEE 754 floats actually define Infinity as being equal to itself, unlike NaN)

@ion-elgreco
Copy link
Contributor Author

@sm-Fifteen I think you covered all of them. I personally use the last one the most, since in many cases I just need to replace some erroneous values with None or another value. While mapping existing values to new ones, so behavior 1 and 2 is less frequent.

@ion-elgreco: It might make sense to split them, then. Having "dict_lookup" (cases 1 and 2) and "replace_values" (case 3) be separate functions, given the separate concerns and semantics. Personally I only use the first case, and that was the use case I was trying to cover when I opened issue #3789.

The value replacement behavior may also need to be clarified, in case it works differently in Rust than it does in Python:

>>> mydict = {float('-0'): False, float('nan'): ..., float('inf'): ..., float('-inf'): ...}
>>> mydict
{-0.0: False, nan: Ellipsis, inf: Ellipsis, -inf: Ellipsis}

>>> mydict[0]
False

>>> mydict[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

>>> mydict[float('inf')]
Ellipsis

(TIL IEEE 754 floats actually define Infinity as being equal to itself, unlike NaN)

That's an option. Thenreplace_values can just be some syntactic sugar for map_dict. @stinodego

@cmdlineluser
Copy link
Contributor

While trying to understand a schema mismatch issue with .map_dict yesterday I added a version on the Rust side.

I needed a non-clashing name, so I used .map_keys as in "map all these values and use them as keys"

.map_dict() already turns the dict it into a DataFrame:

remap_frame = pl.LazyFrame(data=[remap_key_s, remap_value_s]).with_columns(

So I set it up that it checks if the mapping is already a 2-column dataframe, if not create one so we can also do pl.col("foo").map_keys(df_with_2_columns)

Is there a reason why the mapping is tied to being a dictionary even though it gets turned into a dataframe internally?

@stinodego
Copy link
Member

Regarding the use cases specified by @sm-Fifteen:

I'd like to address the first use case separately. We will be expanding the Categorical type (perhaps adding a dedicated Enum type). There can be a separate namespace method for mapping those types, e.g. cat.map. This can introduce the desired behavior (constraints on the mapping).

Use cases 2 and 3 do feel similar enough to be covered by the same function, as they are now.

Name

Whatever we do, we have to rename map_dict as the name just doesn't make sense. I guess it is meant to indicate "map values according to the given dictionary". But even so, the input shouldn't need to be a dictionary, any mapping type should work. So really it's named map_map which sounds like something an extraterrestrial creature might say.

So map would make sense, since we're mapping values. However, it's taken. map in Polars indicates use of UDFs.

An alternative which I like is replace. We're replacing one value by another. So that's the one I think we should go for.

Behavior

Regarding the default behavior of values not in the specified map: I strongly believe the default behaviour should be to keep unspecified values intact.

If I have values 1 and 2, and I replace value 2 by 3, I end up with values 1 and 3. Not null and 3.

So the default will become "keep existing value" (behind the scenes this is pl.first()). If any other value is desired, it can be passed to the default argument.

@sm-Fifteen
Copy link

I'd like to address the first use case separately. We will be expanding the Categorical type (perhaps adding a dedicated Enum type). There can be a separate namespace method for mapping those types, e.g. cat.map. This can introduce the desired behavior (constraints on the mapping).

That seems reasonnable. The first case is indeed operating with the assumption of a finite set of possible values, with the assumption that all accepted values are keys in the dictionnary and the mapping is complete, which would make sense as an operation on categorical columns.

As for the other 2 cases, mimicking Pandas' replace API with a default argument (and/or possibly accepting a DefaultDict) seems like a pretty good way of covering everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants