Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

Closed

Conversation

chrispe
Copy link
Contributor

@chrispe chrispe commented May 5, 2020

@chrispe
Copy link
Contributor Author

chrispe commented May 5, 2020

To be honest I'm not quite happy with my solution. Do you think there's a nicer way to include this fix specifically to the categorical module instead? It feels like that would require some major refactoring though. Any opinion?

@chrispe chrispe changed the title Add fix to raise error when category value is not predefined Add fix to raise error when category value 'x' is not predefined but is assigned to a DF when df.loc[len(df)+1] = x May 7, 2020
@chrispe chrispe changed the title Add fix to raise error when category value 'x' is not predefined but is assigned to a DF when df.loc[len(df)+1] = x Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc May 7, 2020
@chrispe chrispe changed the title Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x May 7, 2020
@simonjayhawkins
Copy link
Member

To be honest I'm not quite happy with my solution. Do you think there's a nicer way to include this fix specifically to the categorical module instead? It feels like that would require some major refactoring though. Any opinion?

see #25383 (comment)

from #25383, the expected behaviour is

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.

along with the test added here to confirm the an error raised, could also add test to confirm that unused categories are retained.

@simonjayhawkins simonjayhawkins added Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves labels May 8, 2020
@pep8speaks
Copy link

pep8speaks commented May 16, 2020

Hello @chrispe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-13 20:30:26 UTC

@chrispe
Copy link
Contributor Author

chrispe commented May 16, 2020

To be honest I'm not quite happy with my solution. Do you think there's a nicer way to include this fix specifically to the categorical module instead? It feels like that would require some major refactoring though. Any opinion?

see #25383 (comment)

from #25383, the expected behaviour is

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.

along with the test added here to confirm the an error raised, could also add test to confirm that unused categories are retained.

I've included the test case you described. What are we looking to achieve with this PR at the end? My current changes resolve the issue of assigning a non-existing categorical value to a Categorical series, but then of course that doesn't resolve the core issue mentioned at #25383.

@simonjayhawkins
Copy link
Member

I've included the test case you described. What are we looking to achieve with this PR at the end? My current changes resolve the issue of assigning a non-existing categorical value to a Categorical series, but then of course that doesn't resolve the core issue mentioned at #25383.

Thinking some more, I'm not sure if raising should be the expected behaviour. This happens at the array level, which is fine for the array as the dtype holds the categories and assigning values directly to the array that are not in the categories should raise.

however, for Series, if the dtype cannot hold an element a new array is created.

>>> ser = pd.Series([1, 2])
>>> ser
0    1
1    2
dtype: int64
>>>
>>> ser.values
array([1, 2], dtype=int64)
>>>
>>> id1 = id(ser.values)
>>> id1
2687613117264
>>>
>>> ser.loc[1] = 42
>>> ser
0     1
1    42
dtype: int64
>>>
>>> id(ser.values) == id1
True
>>>
>>> ser.loc[1] = "42"
>>> ser
0     1
1    42
dtype: object
>>>
>>> id(ser.values) == id1
False
>>>

For a performant operation, that does not create a new array, but updates the array in place and errors if the underlying array cannot hold the elements is Series.at.

I would therefore expect Series.at to raise ValueError: Cannot setitem on a Categorical with a new...

but I would expect Series.loc to create a new array that can hold the new element with the correct categories. This issue needs further discussion IMO.

@chrispe
Copy link
Contributor Author

chrispe commented May 24, 2020

I'm a bit confused about the expected behaviour. In the current version this piece of code has the following result:

>>> import pandas as pd
>>> from pandas import CategoricalDtype
>>> s_data = list("abcd")
>>> s_dtype = CategoricalDtype(list("abc"))
>>> s = pd.Series(s_data, dtype=s_dtype)
>>> s
0      a
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [a, b, c]

>>> # This should raise an exception after the PR
>>> s.loc[4] = "d"
>>> s
0      a
1      b
2      c
3    NaN
4      d
dtype: object

Given that the s expects only values of [a,b,c] because of it's dtype, should we instead raise the same ValueError about non-invalid category value? If not, what's the logic behind the current behaviour? It's a bit contradictory to what we are trying to achieve with raising an exception (instead of replacing the new value with nan) when a non-existing category is inserted through index expansion.

# when an array of valid values is given (GH#25383)
if (
isinstance(to_concat[0], ExtensionArray)
and all(x.shape[0] == 1 for x in to_concat[1:])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a better way to detect when the concat_compat function is called through index expansion, so in cases like this:

ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "c"

With the latest commit we are raising a ValueError when an invalid value is added to the categorical through index expansion. it also enables the index expansion of a categorical of any dtype.

if (
isinstance(to_concat[0], ExtensionArray)
and all(x.shape[0] == 1 for x in to_concat[1:])
and _can_cast_to_categorical(to_concat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this a very complicated implementation. This should all be in `find_common_type`` , but should be much simpler that this. either the dtypes are the same or they are not. changing them is not in scope for this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, scope is more clear to me now. I will revert back to the previous approach and adapt it to raise on dtype mismatch.

@chrispe
Copy link
Contributor Author

chrispe commented Mar 13, 2021

Hi @jreback. I did change my approach, but I does feel like I'm going in circles in terms of complexity, so I may need some further guidance. The thing is that the find_common_type does not take into account the values of the arrays, so we cannot conclude if we should raise an error given only their dtypes.

For example, this should raise an error:

ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "d"
# ValueError "Cannot setitem on a Categorical with a new category, set the categories first"

While this should be fine:

ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "a"

@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

closing as stale, if you want to continue working, please ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setting with enlargement on categorical data
6 participants