Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalDtype can be lost if addressing several values #349

Open
hagenw opened this issue Jan 18, 2023 · 1 comment
Open

CategoricalDtype can be lost if addressing several values #349

hagenw opened this issue Jan 18, 2023 · 1 comment

Comments

@hagenw
Copy link
Member

hagenw commented Jan 18, 2023

This is similar to #324, but the underlying problem seems to be a pandas issue.

Let's start with a filewise and a segmented table, both using a 'spk' scheme with the labels 'a' and 'b', and both containing two entries labeled as 'a'.

import audformat


db = audformat.Database('db')

db.schemes['spk'] = audformat.Scheme('str', labels=['a', 'b'])
index = audformat.filewise_index(['f1', 'f2'])
db['files'] = audformat.Table(index)
db['files']['spk'] = audformat.Column(scheme_id='spk')
db['files']['spk'].set(['a', 'a'])

db.schemes['label'] = audformat.Scheme('int')
index = audformat.segmented_index(['f1', 'f1'], [0, 1], [1, 2])
db['segments'] = audformat.Table(index)
db['segments']['spk'] = audformat.Column(scheme_id='spk')
db['segments']['spk'].set(['a', 'a'])

The following behaves as expected:

>>> df = db['files'].get()
>>> df.spk.cat.categories
Index(['a', 'b'], dtype='object')
>>> df.loc['f1', 'spk'] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first
>>> df.iloc[0, 0] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first
>>> df = db['segments'].get()
>>> df.spk.cat.categories
 Index(['a', 'b'], dtype='object')
>>> df.loc[audformat.segmented_index(['f1'], [0], [1]), 'spk'] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first
>>> df.iloc[0, 0] = 'c'
...
TypeError: Cannot setitem on a Categorical with a new category (c), set the categories first

But we can still force to set a forbidden label and remove CategoricalDtype by addressing several values at once:

>>> df = db['files'].get()
>>> df.loc[:, 'spk'] = 'c'
>>> df.spk.cat.categories
...
AttributeError: Can only use .cat accessor with a 'category' dtype
>>> df
     spk
file    
f1     c
f2     c
>>> df = db['segments'].get()
>>> df.loc[:, 'spk'] = 'c'
>>> df.spk.cat.categories
...
AttributeError: Can only use .cat accessor with a 'category' dtype
>>> df
                                     spk
file start           end                
f1   0 days 00:00:00 0 days 00:00:01   c
     0 days 00:00:01 0 days 00:00:02   c

I'm not sure yet if this is considered a feature or a bug in pandas.

There is no upstream issue that matches directly, but related issues: pandas-dev/pandas#46820, pandas-dev/pandas#40080

@hagenw
Copy link
Member Author

hagenw commented Apr 13, 2023

It might also be considered a feature as it allows to overwrite a whole column.

I checked the two related pandas issues are still open and the behavior with pandas is still the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant