Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

audformat.Table.update() too restrictive #61

Closed
hagenw opened this issue Apr 15, 2021 · 20 comments · Fixed by #276
Closed

audformat.Table.update() too restrictive #61

hagenw opened this issue Apr 15, 2021 · 20 comments · Fixed by #276

Comments

@hagenw
Copy link
Member

hagenw commented Apr 15, 2021

At the moment we require that the scheme of a table that is use to update an existing one has to match, but there are certain scenarios where it makes totally sense that it is only included, but does not match.

E.g. consider the following scenario that matches speaker ID labels:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])

db.update(db_new)

this fails with

...
ValueError: Cannot update database, found different value for 'db.schemes['s']':
dtype: str
labels: ['1', '2']
!=
dtype: str
labels: ['1']
@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

There is also no easy workaround as we cannot simply replace the new scheme with the original one:

db_new.schemes['s'] = db.schemes['s']
db.update(db_new)

fails with

...
ValueError: Found two columns with name 's' buf different dtypes:
CategoricalDtype(categories=['1', '2'], ordered=False) != CategoricalDtype(categories=['1'], ordered=False).

@frankenjoe
Copy link
Collaborator

frankenjoe commented Apr 15, 2021

So what you suggest is that we do not raise an error as long as the scheme in db contains all categories from the scheme in db_new?

@hagenw hagenw changed the title audformat.Table.update() to restrictive audformat.Table.update() too restrictive Apr 15, 2021
@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

Yes, I don't see a reason to raise an error as long as the new scheme is a subset of the old one.

@frankenjoe
Copy link
Collaborator

Yes, I don't see a reason to raise an error as long as the new scheme is a subset of the old one.

Is this the only corner case you see?

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

If you disagree it would be nice to show what we can do instead ;)

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

Yes, I don't see a reason to raise an error as long as the new scheme is a subset of the old one.

Is this the only corner case you see?

Don't know yet, I guess we will discover more when using it.

@frankenjoe
Copy link
Collaborator

There is also no easy workaround as we cannot simply replace the new scheme with the original one:

db_new.schemes['s'] = db.schemes['s']
db.update(db_new)

No that does not work, since this will not update the dtypes in the table.

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

We also might have the case where the new scheme contains a subset, but adds a new entry as well as we now have a new person. How would you handle such a case?

@frankenjoe
Copy link
Collaborator

frankenjoe commented Apr 15, 2021

We also might have the case where the new scheme contains a subset, but adds a new entry as well as we now have a new person. How would you handle such a case?

Good question. Generally, we have to be careful. Otherwise it can happen that you merge schemes / columns that are completely unrelated. The current approach might be restrictive, but it is also secure :)

@frankenjoe
Copy link
Collaborator

Another problem is that a scheme might be referenced by several columns and updating one column can have unwanted side-effects on other columns.

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

We also might have the case where the new scheme contains a subset, but adds a new entry as well as we now have a new person. How would you handle such a case?

Good question. Generally, we have to be careful. Otherwise it can happen that you merge schemes / columns that are completely unrelated. The current approach might be restrictive, but it is also secure :)

Maybe you are right and we should not change the current behavior, but then we need some explanation what to do instead. E.g. for me it's not clear how I can manage to join the two databases at the moment.

@frankenjoe
Copy link
Collaborator

Maybe you are right and we should not change the current behavior, but then we need some explanation what to do instead. E.g. for me it's not clear how I can manage to join the two databases at the moment.

I guess you can rename the scheme and column in the new database before merging. But that's probably not what you want :)

Note that currently the update will also fail if you have the same scheme with a different description.

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

Note that currently the update will also fail if you have the same scheme with a different description.

That's fine with me and would be what I expect.

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

The following workaround let's you update the database I provided in my example:

db_new.schemes['s'] = db.schemes['s']
db_new['t'].df['s'] = db_new['t'].df['s'].astype(db['t'].df['s'].dtype)
db.update(db_new)
db['t'].df

results in

      s
file   
a     1
b     2
c     1

So it's doable, but it doesn't look like we should expect from a user to do it manually as this also adds a burden when databases should be updated and published automatically.

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

Next we investigate what happening if the new database has an overlap with the old scheme labels, but adds a new one as well:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1', '3'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c', 'd']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1', '3'])

@hagenw
Copy link
Member Author

hagenw commented Apr 15, 2021

Here we can update with:

import pandas as pd


common_schemes = list(set(db.schemes['s'].labels + db_new.schemes['s'].labels))
db.schemes['s'] = audformat.Scheme(str, labels=common_schemes)
db_new.schemes['s'] = audformat.Scheme(str, labels=common_schemes)
dtype = pd.api.types.CategoricalDtype(categories=common_schemes, ordered=False)
db['t'].df['s'] = db['t'].df['s'].astype(dtype)
db_new['t'].df['s'] = db_new['t'].df['s'].astype(dtype)
db.update(db_new)
db['t'].df

will result in

      s
file   
a     1
b     2
c     1
d     3

@hagenw
Copy link
Member Author

hagenw commented Apr 27, 2021

There is also an easy solution to the discussed problem, but I'm not sure if you can use this in all cases.
Basically, you can argue that your scheme shouldn't specify the labels if they can change:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme('str')
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme('str')
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])

db.update(db_new)

works without any errors.

The problem is that you might want to use labels to store extra information, e.g. age of a speaker. In the proposed solution above this would no longer be possible and you would have to add the extra information as columns.

@hagenw
Copy link
Member Author

hagenw commented Jan 5, 2022

We now have the audformat.Scheme.replace_labels() method which can be used to solve my examples.
@frankenjoe if you agree that this is sufficient, I think we could add an example to the docs and afterwards close this issue.


Example 1

db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])

to update we can do:

db_new.schemes['s'].replace_labels(db.schemes['s'].labels)
db.update(db_new)

Example 2

db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1', '3'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c', 'd']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1', '3'])

to update we can do:

common_labels = list(set(db.schemes['s'].labels + db_new.schemes['s'].labels))
db.schemes['s'].replace_labels(common_labels)
db_new.schemes['s'].replace_labels(common_labels)
db.update(db_new)

@hagenw
Copy link
Member Author

hagenw commented Jan 5, 2022

The solution also works with dictionary as labels, e.g.

Example 3

db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels={'1': 'speaker1', '2': 'speaker2'})
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels={'1': 'speaker1', '3': 'speaker3'})
db_new['t'] = audformat.Table(audformat.filewise_index(['c', 'd']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1', '3'])

to update we can do:

common_labels = {**db.schemes['s'].labels, **db_new.schemes['s'].labels}
db.schemes['s'].replace_labels(common_labels)
db_new.schemes['s'].replace_labels(common_labels)
db.update(db_new)

@hagenw
Copy link
Member Author

hagenw commented Jan 6, 2022

I just realized that we have functions in audformat.utils to help with the desired task (audformat.utils.join_labels(), audformat.utils.join_schemes()). This means all the examples from above can be solved by running:

audformat.utils.join_schemes([db, db_new], 's')
db.update(db_new)

which is straightforward, you just need to find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants