audformat.Table.update() too restrictive #61

hagenw · 2021-04-15T08:06:03Z

At the moment we require that the scheme of a table that is use to update an existing one has to match, but there are certain scenarios where it makes totally sense that it is only included, but does not match.

E.g. consider the following scenario that matches speaker ID labels:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])

db.update(db_new)

this fails with

...
ValueError: Cannot update database, found different value for 'db.schemes['s']':
dtype: str
labels: ['1', '2']
!=
dtype: str
labels: ['1']

The text was updated successfully, but these errors were encountered:

hagenw · 2021-04-15T08:08:23Z

There is also no easy workaround as we cannot simply replace the new scheme with the original one:

db_new.schemes['s'] = db.schemes['s']
db.update(db_new)

fails with

...
ValueError: Found two columns with name 's' buf different dtypes:
CategoricalDtype(categories=['1', '2'], ordered=False) != CategoricalDtype(categories=['1'], ordered=False).

frankenjoe · 2021-04-15T08:12:46Z

So what you suggest is that we do not raise an error as long as the scheme in db contains all categories from the scheme in db_new?

hagenw · 2021-04-15T08:14:45Z

Yes, I don't see a reason to raise an error as long as the new scheme is a subset of the old one.

frankenjoe · 2021-04-15T08:15:20Z

Yes, I don't see a reason to raise an error as long as the new scheme is a subset of the old one.

Is this the only corner case you see?

hagenw · 2021-04-15T08:15:23Z

If you disagree it would be nice to show what we can do instead ;)

hagenw · 2021-04-15T08:15:46Z

Yes, I don't see a reason to raise an error as long as the new scheme is a subset of the old one.

Is this the only corner case you see?

Don't know yet, I guess we will discover more when using it.

frankenjoe · 2021-04-15T08:15:59Z

There is also no easy workaround as we cannot simply replace the new scheme with the original one:
db_new.schemes['s'] = db.schemes['s']
db.update(db_new)

No that does not work, since this will not update the dtypes in the table.

hagenw · 2021-04-15T08:18:29Z

We also might have the case where the new scheme contains a subset, but adds a new entry as well as we now have a new person. How would you handle such a case?

frankenjoe · 2021-04-15T08:22:13Z

We also might have the case where the new scheme contains a subset, but adds a new entry as well as we now have a new person. How would you handle such a case?

Good question. Generally, we have to be careful. Otherwise it can happen that you merge schemes / columns that are completely unrelated. The current approach might be restrictive, but it is also secure :)

frankenjoe · 2021-04-15T08:24:17Z

Another problem is that a scheme might be referenced by several columns and updating one column can have unwanted side-effects on other columns.

hagenw · 2021-04-15T08:26:48Z

We also might have the case where the new scheme contains a subset, but adds a new entry as well as we now have a new person. How would you handle such a case?

Good question. Generally, we have to be careful. Otherwise it can happen that you merge schemes / columns that are completely unrelated. The current approach might be restrictive, but it is also secure :)

Maybe you are right and we should not change the current behavior, but then we need some explanation what to do instead. E.g. for me it's not clear how I can manage to join the two databases at the moment.

frankenjoe · 2021-04-15T08:30:22Z

Maybe you are right and we should not change the current behavior, but then we need some explanation what to do instead. E.g. for me it's not clear how I can manage to join the two databases at the moment.

I guess you can rename the scheme and column in the new database before merging. But that's probably not what you want :)

Note that currently the update will also fail if you have the same scheme with a different description.

hagenw · 2021-04-15T08:31:29Z

Note that currently the update will also fail if you have the same scheme with a different description.

That's fine with me and would be what I expect.

hagenw · 2021-04-15T09:23:05Z

The following workaround let's you update the database I provided in my example:

db_new.schemes['s'] = db.schemes['s']
db_new['t'].df['s'] = db_new['t'].df['s'].astype(db['t'].df['s'].dtype)
db.update(db_new)
db['t'].df

results in

      s
file   
a     1
b     2
c     1

So it's doable, but it doesn't look like we should expect from a user to do it manually as this also adds a burden when databases should be updated and published automatically.

hagenw · 2021-04-15T09:26:10Z

Next we investigate what happening if the new database has an overlap with the old scheme labels, but adds a new one as well:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1', '3'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c', 'd']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1', '3'])

hagenw · 2021-04-15T09:31:21Z

Here we can update with:

import pandas as pd


common_schemes = list(set(db.schemes['s'].labels + db_new.schemes['s'].labels))
db.schemes['s'] = audformat.Scheme(str, labels=common_schemes)
db_new.schemes['s'] = audformat.Scheme(str, labels=common_schemes)
dtype = pd.api.types.CategoricalDtype(categories=common_schemes, ordered=False)
db['t'].df['s'] = db['t'].df['s'].astype(dtype)
db_new['t'].df['s'] = db_new['t'].df['s'].astype(dtype)
db.update(db_new)
db['t'].df

will result in

hagenw · 2021-04-27T06:00:56Z

There is also an easy solution to the discussed problem, but I'm not sure if you can use this in all cases.
Basically, you can argue that your scheme shouldn't specify the labels if they can change:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme('str')
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme('str')
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])

db.update(db_new)

works without any errors.

The problem is that you might want to use labels to store extra information, e.g. age of a speaker. In the proposed solution above this would no longer be possible and you would have to add the extra information as columns.

hagenw · 2022-01-05T09:31:49Z

We now have the audformat.Scheme.replace_labels() method which can be used to solve my examples.
@frankenjoe if you agree that this is sufficient, I think we could add an example to the docs and afterwards close this issue.

Example 1

db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])

to update we can do:

db_new.schemes['s'].replace_labels(db.schemes['s'].labels)
db.update(db_new)

Example 2

db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1', '3'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c', 'd']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1', '3'])

to update we can do:

common_labels = list(set(db.schemes['s'].labels + db_new.schemes['s'].labels))
db.schemes['s'].replace_labels(common_labels)
db_new.schemes['s'].replace_labels(common_labels)
db.update(db_new)

hagenw · 2022-01-05T09:36:33Z

The solution also works with dictionary as labels, e.g.

Example 3

db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels={'1': 'speaker1', '2': 'speaker2'})
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels={'1': 'speaker1', '3': 'speaker3'})
db_new['t'] = audformat.Table(audformat.filewise_index(['c', 'd']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1', '3'])

to update we can do:

common_labels = {**db.schemes['s'].labels, **db_new.schemes['s'].labels}
db.schemes['s'].replace_labels(common_labels)
db_new.schemes['s'].replace_labels(common_labels)
db.update(db_new)

hagenw · 2022-01-06T06:43:51Z

I just realized that we have functions in audformat.utils to help with the desired task (audformat.utils.join_labels(), audformat.utils.join_schemes()). This means all the examples from above can be solved by running:

audformat.utils.join_schemes([db, db_new], 's')
db.update(db_new)

which is straightforward, you just need to find it.

hagenw changed the title ~~audformat.Table.update() to restrictive~~ audformat.Table.update() too restrictive Apr 15, 2021

frankenjoe mentioned this issue Apr 26, 2021

Add audformat.Scheme.replace_labels() #62

Merged

hagenw mentioned this issue Jan 5, 2022

Add method to move files between tables #83

Open

This was referenced Aug 4, 2022

Make sure misc tables are updated in the correct order in Database.update() #258

Closed

Raise error for misc table in join_labels() + join_schemes() #275

Merged

frankenjoe mentioned this issue Aug 4, 2022

Join schemes in Database.update() #276

Merged

hagenw closed this as completed in #276 Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audformat.Table.update() too restrictive #61

audformat.Table.update() too restrictive #61

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021 •

edited

Loading

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021 •

edited

Loading

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021 •

edited

Loading

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 27, 2021

hagenw commented Jan 5, 2022

hagenw commented Jan 5, 2022

hagenw commented Jan 6, 2022

audformat.Table.update() too restrictive #61

audformat.Table.update() too restrictive #61

Comments

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021 • edited Loading

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021

frankenjoe commented Apr 15, 2021 • edited Loading

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021 • edited Loading

frankenjoe commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 15, 2021

hagenw commented Apr 27, 2021

hagenw commented Jan 5, 2022

hagenw commented Jan 5, 2022

hagenw commented Jan 6, 2022

frankenjoe commented Apr 15, 2021 •

edited

Loading

frankenjoe commented Apr 15, 2021 •

edited

Loading

hagenw commented Apr 15, 2021 •

edited

Loading