-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default behavior map_dict
#10755
Comments
Agree here. An alternative would be to allow passing of a |
Disagree. If you didn’t specify what to map, it’s undefined and should default to None instead of implicitly doing something |
I also dislike the implicitness of default value to be Moreover, I think also the naming is quite poor in this context. I get that it refers to the first as in "current value comes first, mapped value second" to keep the current value (i.e. not map), but alternatively, it could also mean "grab the first value in the column" or "first value in the dict" in case of no mapping. From that perspective, I like the more explicit nature of doing this (per the docstring: df.with_columns(
pl.col("country_code")
.map_dict(country_code_dict, default=pl.col("country_code"))
) over: df.with_columns(
pl.col("country_code")
.map_dict(country_code_dict, default=pl.first()) # dont think here "first country in df" or "first country in the dict"
) So I would even vote for removing |
I view
I agree here completely, I understand that algorithmically it picks the first value, but semantically it's a little odd, since we're essentially mapping one set to another so the first |
I made small error in opening the issue. It was actually suggested in the other issue on renaming apply and map, to change this to |
Yes please! This is a common source of bugs in my experience. |
Even though they are closely related, is there an argument for just having a separate "replace" function? The most recent example from SO: "Replace https://stackoverflow.com/questions/77019754/ df = pl.DataFrame({"label": ["a", "b", "c"],
"ratio_a": [1.0, 2.0, np.inf],
"ratio_b": [10.0, np.inf, 20.0],
"ratio_c": [3.0, 200, 300],
"ratio_d": [-1.0, -2.0, np.inf],
}) # explict when/then/otherwise
df.with_columns(
pl.when(cs.numeric() == float('inf')).then(None).otherwise(cs.numeric()).keep_name()
)
# .map_dict / default=pl.first()
df.with_columns(
cs.numeric().map_dict({float('inf'): None}, default=pl.first())
) As the replacement value is # implicit otherwise(None)
df.with_columns(
pl.when(cs.numeric() != float('inf')).then(cs.numeric())
) It seems like a common enough operation where it would be nice to be able to just: df.with_columns(
cs.numeric().replace(float('inf'), None)
) i.e. being able to specify a single replacement, or pass a dict mapping, and the function can call def replace(self, old_or_mapping, new=None):
if not isinstance(old_or_mapping, dict):
...
self.map_dict(..., default=pl.first()) |
Yes I agree with the others in that we need a Polars-equivalent to Pandas' .replace() function |
Ok, so what are the use cases we're trying to cover, here? Because I think that might help to clear up what the function should be named and if there should be any alternate ones.
Is there anything I missed, and can we agree on any single behavior, or should there be separate functions to cover these use cases? I think the expected default behavior is going to heavily affect our perceptions of what the function should and shouldn't be named. |
@sm-Fifteen I think you covered all of them. I personally use the last one the most, since in many cases I just need to replace some erroneous values with None or another value. While mapping existing values to new ones, so behavior 1 and 2 is less frequent. |
Yeah, the third one is definitely the most common use case. |
@ion-elgreco: It might make sense to split them, then. Having " The value replacement behavior may also need to be clarified, in case it works differently in Rust than it does in Python: >>> mydict = {float('-0'): False, float('nan'): ..., float('inf'): ..., float('-inf'): ...}
>>> mydict
{-0.0: False, nan: Ellipsis, inf: Ellipsis, -inf: Ellipsis}
>>> mydict[0]
False
>>> mydict[float('nan')]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: nan
>>> mydict[float('inf')]
Ellipsis (TIL IEEE 754 floats actually define Infinity as being equal to itself, unlike NaN) |
That's an option. Then |
While trying to understand a schema mismatch issue with I needed a non-clashing name, so I used
polars/py-polars/polars/expr/expr.py Line 9316 in 23c367b
So I set it up that it checks if the mapping is already a 2-column dataframe, if not create one so we can also do Is there a reason why the mapping is tied to being a dictionary even though it gets turned into a dataframe internally? |
Regarding the use cases specified by @sm-Fifteen: I'd like to address the first use case separately. We will be expanding the Categorical type (perhaps adding a dedicated Enum type). There can be a separate namespace method for mapping those types, e.g. Use cases 2 and 3 do feel similar enough to be covered by the same function, as they are now. NameWhatever we do, we have to rename So An alternative which I like is BehaviorRegarding the default behavior of values not in the specified map: I strongly believe the default behaviour should be to keep unspecified values intact. If I have values 1 and 2, and I replace value 2 by 3, I end up with values 1 and 3. Not null and 3. So the default will become "keep existing value" (behind the scenes this is |
That seems reasonnable. The first case is indeed operating with the assumption of a finite set of possible values, with the assumption that all accepted values are keys in the dictionnary and the mapping is complete, which would make sense as an operation on categorical columns. As for the other 2 cases, mimicking Pandas' |
Problem description
The current default behavior of
map_dict
is to only replace the matching values, if it's not in the dict, it will return Nones. With the possible renaming toreplace_values
#10744 (comment), setting the default parameter, to pl.first() would make more sense.default=pl.first()
. Then if you want to explicitly replace only the ones in the dict, you can pass None to default.The text was updated successfully, but these errors were encountered: