Fix: deduplicate subjects on works and list items on editions #8663

scottbarnes · 2023-12-21T23:03:04Z

Closes #8661.

This PR de-duplicates, using casefold(), the subjects field on Work items, and field values of list items that are added to Edition items on re-import via load().

This does not affect edits made via the UI as they go through DocSaveHelper() and process_work() from openlibrary/plugins/upstream/addbook.py, which de-duplicates via a slightly differest strategy. Rather than merging two lists (existing subjects on a matched item, and new ones from an import item), it takes form data with every subject as a CSV and dedupes that string.

Though I had hoped to unify the de-duping logic, I think that is beyond the scope of this particular issue.

Technical

I added a slight bit of logic to the fields that looked as if they might possibly result in duplicates because of a lack of case-insensitive matching.

Testing

The unit tests should cover this, but the test is whether it's possible, via load(), to add a duplicated subject or value in a list field on reimport after accounting for Python's string.casefold() method. That is to say, for the purposes of this PR, if "Straße" exists as a subject on a Work, "strasse" should not be added via load().

NOTE: Because the web UI (e.g. the Work and Edition edit pages) uses process_work(), which currently de-duplicates via string.lower(), this PR changes the behavior there to use string.casefold() so it matches the load() de-duping.

Screenshot

Stakeholders

This PR de-duplicates, using `casefold()`, the `subjects` field on `Work` items, and field values of list items that are added to `Edition` items on import via `load()`. This does not affect edits made via the UI as they go through `process_work()` from `openlibrary/plugins/upstream/addbook.py`, which de-duplicates via a slightly differest strategy. Rather than merging two lists (existing subjects on a matched item, new ones from an import item), it takes form data with every subject and dedupes that list. Though I had hoped to unify the de-duping logic, I think that is beyond the scope of this particular issue. For more, see: internetarchive#8661

tfmorris · 2023-12-22T17:03:00Z

openlibrary/catalog/add_book/tests/test_add_book.py

+    def get_casefold_sort(item_list: list[str]):
+        return sorted([item.casefold() for item in item_list])
+
+    expected = ['granite', 'Straße', 'ΠΑΡΆΔΕΙΣΟΣ', 'sandstone']


Isn't "Granite" the preferred capitalization here?

Trying to determine which casing is better would be non-trivial ; currently it's just choosing to use the first item in the list. That's good enough for now. Things like making subjects consistent (eg preferring the title case variant) might also be best handled at another layer.

openlibrary/catalog/add_book/__init__.py

openlibrary/plugins/upstream/addbook.py

cdrini

Lgtm! One small change then merge at your discretion.

- fix comparison - remove pointless sorting - ensure tests catch case where 1 existing duplicate is removed and 1 new item is added, resulting in a final list the same length as the original.

scottbarnes added 2 commits December 21, 2023 15:09

Make the web UI / DocSaveHelper() dedupe using casefold()

fa1845e

scottbarnes force-pushed the 8661/fix/make-import-and-share-deduplication-logic branch from 4c81801 to fa1845e Compare December 21, 2023 23:09

tfmorris reviewed Dec 22, 2023

View reviewed changes

cdrini reviewed Dec 22, 2023

View reviewed changes

openlibrary/catalog/add_book/__init__.py Outdated Show resolved Hide resolved

cdrini self-assigned this Dec 25, 2023

Use uniq to deduplicate subjects

363fe1c

cdrini reviewed Dec 27, 2023

View reviewed changes

openlibrary/catalog/add_book/__init__.py Outdated Show resolved Hide resolved

cdrini reviewed Dec 27, 2023

View reviewed changes

openlibrary/plugins/upstream/addbook.py Show resolved Hide resolved

cdrini approved these changes Dec 27, 2023

View reviewed changes

PR review fixes

8b9bffb

- fix comparison - remove pointless sorting - ensure tests catch case where 1 existing duplicate is removed and 1 new item is added, resulting in a final list the same length as the original.

scottbarnes force-pushed the 8661/fix/make-import-and-share-deduplication-logic branch from 2815144 to 8b9bffb Compare December 27, 2023 21:51

scottbarnes merged commit 2e54ee9 into internetarchive:master Dec 27, 2023
3 checks passed

scottbarnes deleted the 8661/fix/make-import-and-share-deduplication-logic branch December 27, 2023 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: deduplicate subjects on works and list items on editions #8663

Fix: deduplicate subjects on works and list items on editions #8663

scottbarnes commented Dec 21, 2023 •

edited

Loading

tfmorris Dec 22, 2023

cdrini Dec 22, 2023

cdrini left a comment •

edited

Loading

Fix: deduplicate subjects on works and list items on editions #8663

Fix: deduplicate subjects on works and list items on editions #8663

Conversation

scottbarnes commented Dec 21, 2023 • edited Loading

Technical

Testing

Screenshot

Stakeholders

tfmorris Dec 22, 2023

Choose a reason for hiding this comment

cdrini Dec 22, 2023

Choose a reason for hiding this comment

cdrini left a comment • edited Loading

Choose a reason for hiding this comment

scottbarnes commented Dec 21, 2023 •

edited

Loading

cdrini left a comment •

edited

Loading