-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011
Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011
Conversation
To be honest I'm not quite happy with my solution. Do you think there's a nicer way to include this fix specifically to the categorical module instead? It feels like that would require some major refactoring though. Any opinion? |
see #25383 (comment) from #25383, the expected behaviour is
along with the test added here to confirm the an error raised, could also add test to confirm that unused categories are retained. |
…x-category-index-issue-33952
I've included the test case you described. What are we looking to achieve with this PR at the end? My current changes resolve the issue of assigning a non-existing categorical value to a Categorical series, but then of course that doesn't resolve the core issue mentioned at #25383. |
Thinking some more, I'm not sure if raising should be the expected behaviour. This happens at the array level, which is fine for the array as the dtype holds the categories and assigning values directly to the array that are not in the categories should raise. however, for Series, if the dtype cannot hold an element a new array is created.
For a performant operation, that does not create a new array, but updates the array in place and errors if the underlying array cannot hold the elements is Series.at. I would therefore expect Series.at to raise ValueError: Cannot setitem on a Categorical with a new... but I would expect Series.loc to create a new array that can hold the new element with the correct categories. This issue needs further discussion IMO. |
…x-category-index-issue-33952
I'm a bit confused about the expected behaviour. In the current version this piece of code has the following result: >>> import pandas as pd
>>> from pandas import CategoricalDtype
>>> s_data = list("abcd")
>>> s_dtype = CategoricalDtype(list("abc"))
>>> s = pd.Series(s_data, dtype=s_dtype)
>>> s
0 a
1 b
2 c
3 NaN
dtype: category
Categories (3, object): [a, b, c]
>>> # This should raise an exception after the PR
>>> s.loc[4] = "d"
>>> s
0 a
1 b
2 c
3 NaN
4 d
dtype: object Given that the |
pandas/core/dtypes/concat.py
Outdated
# when an array of valid values is given (GH#25383) | ||
if ( | ||
isinstance(to_concat[0], ExtensionArray) | ||
and all(x.shape[0] == 1 for x in to_concat[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a better way to detect when the concat_compat
function is called through index expansion, so in cases like this:
ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "c"
With the latest commit we are raising a ValueError
when an invalid value is added to the categorical through index expansion. it also enables the index expansion of a categorical of any dtype
.
pandas/core/dtypes/concat.py
Outdated
if ( | ||
isinstance(to_concat[0], ExtensionArray) | ||
and all(x.shape[0] == 1 for x in to_concat[1:]) | ||
and _can_cast_to_categorical(to_concat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this a very complicated implementation. This should all be in `find_common_type`` , but should be much simpler that this. either the dtypes are the same or they are not. changing them is not in scope for this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, scope is more clear to me now. I will revert back to the previous approach and adapt it to raise on dtype
mismatch.
…x-category-index-issue-33952
…x-category-index-issue-33952
…x-category-index-issue-33952
…x-category-index-issue-33952
Hi @jreback. I did change my approach, but I does feel like I'm going in circles in terms of complexity, so I may need some further guidance. The thing is that the For example, this should raise an error: ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "d"
# ValueError "Cannot setitem on a Categorical with a new category, set the categories first" While this should be fine: ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "a" |
closing as stale, if you want to continue working, please ping. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff