Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

chrispe · 2020-05-05T20:36:34Z

closes Setting with enlargement on categorical data #25383
tests passed, new test added
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

chrispe · 2020-05-05T20:40:24Z

To be honest I'm not quite happy with my solution. Do you think there's a nicer way to include this fix specifically to the categorical module instead? It feels like that would require some major refactoring though. Any opinion?

simonjayhawkins · 2020-05-08T11:01:50Z

To be honest I'm not quite happy with my solution. Do you think there's a nicer way to include this fix specifically to the categorical module instead? It feels like that would require some major refactoring though. Any opinion?

see #25383 (comment)

from #25383, the expected behaviour is

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.

along with the test added here to confirm the an error raised, could also add test to confirm that unused categories are retained.

…x-category-index-issue-33952

pep8speaks · 2020-05-16T16:06:58Z

Hello @chrispe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-13 20:30:26 UTC

chrispe · 2020-05-16T20:03:24Z

To be honest I'm not quite happy with my solution. Do you think there's a nicer way to include this fix specifically to the categorical module instead? It feels like that would require some major refactoring though. Any opinion?

see #25383 (comment)

from #25383, the expected behaviour is

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.

along with the test added here to confirm the an error raised, could also add test to confirm that unused categories are retained.

I've included the test case you described. What are we looking to achieve with this PR at the end? My current changes resolve the issue of assigning a non-existing categorical value to a Categorical series, but then of course that doesn't resolve the core issue mentioned at #25383.

simonjayhawkins · 2020-05-21T15:28:21Z

I've included the test case you described. What are we looking to achieve with this PR at the end? My current changes resolve the issue of assigning a non-existing categorical value to a Categorical series, but then of course that doesn't resolve the core issue mentioned at #25383.

Thinking some more, I'm not sure if raising should be the expected behaviour. This happens at the array level, which is fine for the array as the dtype holds the categories and assigning values directly to the array that are not in the categories should raise.

however, for Series, if the dtype cannot hold an element a new array is created.

>>> ser = pd.Series([1, 2])
>>> ser
0    1
1    2
dtype: int64
>>>
>>> ser.values
array([1, 2], dtype=int64)
>>>
>>> id1 = id(ser.values)
>>> id1
2687613117264
>>>
>>> ser.loc[1] = 42
>>> ser
0     1
1    42
dtype: int64
>>>
>>> id(ser.values) == id1
True
>>>
>>> ser.loc[1] = "42"
>>> ser
0     1
1    42
dtype: object
>>>
>>> id(ser.values) == id1
False
>>>

For a performant operation, that does not create a new array, but updates the array in place and errors if the underlying array cannot hold the elements is Series.at.

I would therefore expect Series.at to raise ValueError: Cannot setitem on a Categorical with a new...

but I would expect Series.loc to create a new array that can hold the new element with the correct categories. This issue needs further discussion IMO.

pandas/core/generic.py

…x-category-index-issue-33952

chrispe · 2020-05-24T15:18:08Z

I'm a bit confused about the expected behaviour. In the current version this piece of code has the following result:

>>> import pandas as pd
>>> from pandas import CategoricalDtype
>>> s_data = list("abcd")
>>> s_dtype = CategoricalDtype(list("abc"))
>>> s = pd.Series(s_data, dtype=s_dtype)
>>> s
0      a
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [a, b, c]

>>> # This should raise an exception after the PR
>>> s.loc[4] = "d"
>>> s
0      a
1      b
2      c
3    NaN
4      d
dtype: object

Given that the s expects only values of [a,b,c] because of it's dtype, should we instead raise the same ValueError about non-invalid category value? If not, what's the logic behind the current behaviour? It's a bit contradictory to what we are trying to achieve with raising an exception (instead of replacing the new value with nan) when a non-existing category is inserted through index expansion.

chrispe · 2021-02-15T10:27:00Z

pandas/core/dtypes/concat.py

+            # when an array of valid values is given (GH#25383)
+            if (
+                isinstance(to_concat[0], ExtensionArray)
+                and all(x.shape[0] == 1 for x in to_concat[1:])


I couldn't find a better way to detect when the concat_compat function is called through index expansion, so in cases like this:

ser = pd.Series(Categorical(["a", "b", "c"])) ser.loc[3] = "c"

With the latest commit we are raising a ValueError when an invalid value is added to the categorical through index expansion. it also enables the index expansion of a categorical of any dtype.

jreback · 2021-02-15T22:11:45Z

pandas/core/dtypes/concat.py

+            if (
+                isinstance(to_concat[0], ExtensionArray)
+                and all(x.shape[0] == 1 for x in to_concat[1:])
+                and _can_cast_to_categorical(to_concat)


this a very complicated implementation. This should all be in `find_common_type`` , but should be much simpler that this. either the dtypes are the same or they are not. changing them is not in scope for this issue.

Ok, scope is more clear to me now. I will revert back to the previous approach and adapt it to raise on dtype mismatch.

…x-category-index-issue-33952

chrispe · 2021-03-13T21:51:13Z

Hi @jreback. I did change my approach, but I does feel like I'm going in circles in terms of complexity, so I may need some further guidance. The thing is that the find_common_type does not take into account the values of the arrays, so we cannot conclude if we should raise an error given only their dtypes.

For example, this should raise an error:

ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "d"
# ValueError "Cannot setitem on a Categorical with a new category, set the categories first"

While this should be fine:

ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "a"

jreback · 2021-10-04T00:11:42Z

closing as stale, if you want to continue working, please ping.

Add fix to raise error when category value is not predefined

4487bec

chrispe added 2 commits May 5, 2020 23:02

Fix linting

10098ab

Added new test

cb34580

chrispe changed the title ~~Add fix to raise error when category value is not predefined~~ Add fix to raise error when category value 'x' is not predefined but is assigned to a DF when df.loc[len(df)+1] = x May 7, 2020

chrispe changed the title ~~Add fix to raise error when category value 'x' is not predefined but is assigned to a DF when df.loc[len(df)+1] = x~~ Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc May 7, 2020

chrispe changed the title ~~Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc~~ Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x May 7, 2020

simonjayhawkins added Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves labels May 8, 2020

chrispe added 2 commits May 16, 2020 17:34

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

1622663

…x-category-index-issue-33952

Add test case for unused categories

c627fa6

chrispe added 2 commits May 16, 2020 18:08

Remove trailing whitespace

ba3a751

Fix linting

51dcdfe

Fix linting

9057b26

simonjayhawkins reviewed May 21, 2020

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

chrispe added 2 commits May 23, 2020 21:02

Remove temporary fix from generic.py

06fdc3e

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

8c8f794

…x-category-index-issue-33952

chrispe added 8 commits May 24, 2020 18:50

First fix try through indexing.py

582c023

Fix lint

730fc2b

Fix import ordering

c275eb9

Fix Update

944ae24

Fix lint

8372bdb

Include more related test cases

0e5e418

Fix linting

eea359a

Update test_indexing.py

5f72d4e

chrispe added 3 commits February 15, 2021 11:07

Add new version with raise

af5e141

Add format fixes

6d45570

Update test_categorical.py

31612ed

chrispe commented Feb 15, 2021

View reviewed changes

jreback reviewed Feb 15, 2021

View reviewed changes

chrispe added 21 commits February 17, 2021 15:24

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

4a2a8e8

…x-category-index-issue-33952

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

dd7e3ca

…x-category-index-issue-33952

Update

6d9e667

Use prio_cat_dtype only for EAs

e0da655

Revert usage of first_ea

92d1f14

Fix mypy errors

9b9b382

Use unique1d in _cast_to_common_type

d3df994

Fix isort error

41aa9e3

Renamed input variable for find_common_type

ca0eb1f

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

e2cfb79

…x-category-index-issue-33952

Remove new argument in find_common_type

931d6c8

Add check to _get_common_dtype

8065ddb

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

5d533dd

…x-category-index-issue-33952

Update dtypes.py

b21326b

Update dtypes.py

335fc06

Update dtypes.py

950dcc4

Update dtypes.py

2ee1df8

Test

17120f0

Add flag in get_common_type

439b49f

Revert

c6e3435

Update dtypes.py

fc40817

jreback closed this Oct 4, 2021

simonjayhawkins mentioned this pull request Jul 9, 2022

BUG: Setting incompatible values into ea column raises instead of casting to object #47577

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

chrispe commented May 5, 2020 •

edited

Loading

chrispe commented May 5, 2020 •

edited

Loading

simonjayhawkins commented May 8, 2020

pep8speaks commented May 16, 2020 •

edited

Loading

chrispe commented May 16, 2020 •

edited

Loading

simonjayhawkins commented May 21, 2020

chrispe commented May 24, 2020 •

edited

Loading

chrispe Feb 15, 2021

jreback Feb 15, 2021

chrispe Feb 16, 2021

chrispe commented Mar 13, 2021

jreback commented Oct 4, 2021

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

Conversation

chrispe commented May 5, 2020 • edited Loading

chrispe commented May 5, 2020 • edited Loading

simonjayhawkins commented May 8, 2020

pep8speaks commented May 16, 2020 • edited Loading

Comment last updated at 2021-03-13 20:30:26 UTC

chrispe commented May 16, 2020 • edited Loading

simonjayhawkins commented May 21, 2020

chrispe commented May 24, 2020 • edited Loading

chrispe Feb 15, 2021

Choose a reason for hiding this comment

jreback Feb 15, 2021

Choose a reason for hiding this comment

chrispe Feb 16, 2021

Choose a reason for hiding this comment

chrispe commented Mar 13, 2021

jreback commented Oct 4, 2021

chrispe commented May 5, 2020 •

edited

Loading

chrispe commented May 5, 2020 •

edited

Loading

pep8speaks commented May 16, 2020 •

edited

Loading

chrispe commented May 16, 2020 •

edited

Loading

chrispe commented May 24, 2020 •

edited

Loading