Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: deduplicate subjects on works and list items on editions #8663

Conversation

scottbarnes
Copy link
Collaborator

@scottbarnes scottbarnes commented Dec 21, 2023

Closes #8661.

This PR de-duplicates, using casefold(), the subjects field on Work items, and field values of list items that are added to Edition items on re-import via load().

This does not affect edits made via the UI as they go through DocSaveHelper() and process_work() from openlibrary/plugins/upstream/addbook.py, which de-duplicates via a slightly differest strategy. Rather than merging two lists (existing subjects on a matched item, and new ones from an import item), it takes form data with every subject as a CSV and dedupes that string.

Though I had hoped to unify the de-duping logic, I think that is beyond the scope of this particular issue.

Technical

I added a slight bit of logic to the fields that looked as if they might possibly result in duplicates because of a lack of case-insensitive matching.

Testing

The unit tests should cover this, but the test is whether it's possible, via load(), to add a duplicated subject or value in a list field on reimport after accounting for Python's string.casefold() method. That is to say, for the purposes of this PR, if "Straße" exists as a subject on a Work, "strasse" should not be added via load().

NOTE: Because the web UI (e.g. the Work and Edition edit pages) uses process_work(), which currently de-duplicates via string.lower(), this PR changes the behavior there to use string.casefold() so it matches the load() de-duping.

Screenshot

Stakeholders

This PR de-duplicates, using `casefold()`, the `subjects` field on
`Work` items, and field values of list items that are added to `Edition`
items on import via `load()`.

This does not affect edits made via the UI as they go through
`process_work()` from `openlibrary/plugins/upstream/addbook.py`, which
de-duplicates via a slightly differest strategy. Rather than merging
two lists (existing subjects on a matched item, new ones from an import
item), it takes form data with every subject and dedupes that list.

Though I had hoped to unify the de-duping logic, I think that is beyond
the scope of this particular issue.

For more, see:
internetarchive#8661
@scottbarnes scottbarnes force-pushed the 8661/fix/make-import-and-share-deduplication-logic branch from 4c81801 to fa1845e Compare December 21, 2023 23:09
def get_casefold_sort(item_list: list[str]):
return sorted([item.casefold() for item in item_list])

expected = ['granite', 'Straße', 'ΠΑΡΆΔΕΙΣΟΣ', 'sandstone']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "Granite" the preferred capitalization here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to determine which casing is better would be non-trivial ; currently it's just choosing to use the first item in the list. That's good enough for now. Things like making subjects consistent (eg preferring the title case variant) might also be best handled at another layer.

@cdrini cdrini self-assigned this Dec 25, 2023
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! One small change then merge at your discretion.

- fix comparison
- remove pointless sorting
- ensure tests catch case where 1 existing duplicate is removed and 1
  new item is added, resulting in a final list the same length as the
  original.
@scottbarnes scottbarnes force-pushed the 8661/fix/make-import-and-share-deduplication-logic branch from 2815144 to 8b9bffb Compare December 27, 2023 21:51
@scottbarnes scottbarnes merged commit 2e54ee9 into internetarchive:master Dec 27, 2023
3 checks passed
@scottbarnes scottbarnes deleted the 8661/fix/make-import-and-share-deduplication-logic branch December 27, 2023 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Imports should de-duplicate fields on import
3 participants