-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: deduplicate subjects on works and list items on editions #8663
Fix: deduplicate subjects on works and list items on editions #8663
Conversation
This PR de-duplicates, using `casefold()`, the `subjects` field on `Work` items, and field values of list items that are added to `Edition` items on import via `load()`. This does not affect edits made via the UI as they go through `process_work()` from `openlibrary/plugins/upstream/addbook.py`, which de-duplicates via a slightly differest strategy. Rather than merging two lists (existing subjects on a matched item, new ones from an import item), it takes form data with every subject and dedupes that list. Though I had hoped to unify the de-duping logic, I think that is beyond the scope of this particular issue. For more, see: internetarchive#8661
4c81801
to
fa1845e
Compare
def get_casefold_sort(item_list: list[str]): | ||
return sorted([item.casefold() for item in item_list]) | ||
|
||
expected = ['granite', 'Straße', 'ΠΑΡΆΔΕΙΣΟΣ', 'sandstone'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't "Granite" the preferred capitalization here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to determine which casing is better would be non-trivial ; currently it's just choosing to use the first item in the list. That's good enough for now. Things like making subjects consistent (eg preferring the title case variant) might also be best handled at another layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm! One small change then merge at your discretion.
- fix comparison - remove pointless sorting - ensure tests catch case where 1 existing duplicate is removed and 1 new item is added, resulting in a final list the same length as the original.
2815144
to
8b9bffb
Compare
Closes #8661.
This PR de-duplicates, using
casefold()
, thesubjects
field onWork
items, and field values of list items that are added toEdition
items on re-import viaload()
.This does not affect edits made via the UI as they go through
DocSaveHelper()
andprocess_work()
fromopenlibrary/plugins/upstream/addbook.py
, which de-duplicates via a slightly differest strategy. Rather than merging two lists (existing subjects on a matched item, and new ones from an import item), it takes form data with every subject as a CSV and dedupes that string.Though I had hoped to unify the de-duping logic, I think that is beyond the scope of this particular issue.
Technical
I added a slight bit of logic to the fields that looked as if they might possibly result in duplicates because of a lack of case-insensitive matching.
Testing
The unit tests should cover this, but the test is whether it's possible, via
load()
, to add a duplicated subject or value in a list field on reimport after accounting for Python'sstring.casefold()
method. That is to say, for the purposes of this PR, if"Straße"
exists as a subject on aWork
,"strasse"
should not be added viaload()
.NOTE: Because the web UI (e.g. the
Work
andEdition
edit pages) usesprocess_work()
, which currently de-duplicates viastring.lower()
, this PR changes the behavior there to usestring.casefold()
so it matches theload()
de-duping.Screenshot
Stakeholders