-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and move old 'catalog.merge' naming to 'catalog.add_book.match' for import record matching #8296
Conversation
resolved openlibrary/tests/catalog/test_utils.py:236: AssertionError
The input to this is:
Not sure what the input to |
resolved openlibrary/openlibrary/catalog/add_book/__init__.py Lines 827 to 842 in c561294
subtitle splitting: openlibrary/openlibrary/catalog/add_book/__init__.py Lines 182 to 201 in c561294
Looks like there are some naming inconsistencies around titles -- I'm not seeing that the tests take this into account with the test data provided, which may not be realistic or possible. |
resolved in a slightly different way -- the lines have been removed DRY: openlibrary/openlibrary/catalog/add_book/match.py Lines 42 to 44 in c561294
and openlibrary/openlibrary/catalog/add_book/__init__.py Lines 836 to 838 in c561294
Move to:
|
@@ -991,9 +977,11 @@ def test_find_match_is_used_when_looking_for_edition_matches(mock_site) -> None: | |||
This also indirectly tests `merge_marc.editions_match()` (even though it's | |||
not a MARC record. | |||
""" | |||
# Unfortunately this Work level author is totally irrelevant to the matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a test issue? Do live Edition objects inherit authors from their Works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is still a draft, but I'm not sure how long it'll stay open for review, so wanted to add some thoughts now. Doing a reasonable job for Amazon (or worse BWB) is likely to be an impossible task. Things that a title could be polluted with include, but are not limited to, author names, series name, words like "by" "trans." "ed.", etc. Author strings might contain the names of multiple authors mushed together, role names (e.g. trans.).
Something which should be added to the test cases (in addition to the above if you decide to attempt to address Amazon data, which I don't recommend) are some examples which include things like, ": a novel", "un roman", "ein Roman", etc for all the other languages), "Volume", "Vol."
Lastly, a rule based approach is likely to be very difficult to get right. Have you considered training a classifier for this task?
norm = norm[4:] | ||
elif norm.startswith('a '): | ||
norm = norm[2:] | ||
norm = match.normalize(s).replace(' and ', '') | ||
return norm.replace(' ', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word boundaries are important information. It seems odd to remove them.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Merging + monitoring -- spoke w/ @hornc today, mostly a refactor w/ more tests |
Closes #2410
This started as an attempt to figure why some of the title processing seemed inconsistent with some title string normalization code. Trailing dots increased the number of variations in a way that did not seem deliberate, and contained duplicates.
It has turned into a clean up of all the old
catalog.merge
code which seems only to be concerned with matching existing records when a new record is imported (add_book).The title normalisation code,
mk_norm()
andnormalize()
are only used for matching normalisation, not display, so I have moved them out of generalcatalog.utils
and othercatalog.add_book
modules and grouped them all together incatalog.add_book.match
. The normalisations they perform are a bit arbitrary and are very tightly bound to the specific matching algorithms (now) contained incatalog.match
.Now all that code should be grouped together, and all existing tests live in the same place. I've tried to uncouple all the match specific methods out of
catalog.add_book
to reduce confusion, and added a handful more test cases, typehints, comments, and some renaming to make it all a bit clearer. I'm trying not to change the behavior at all. There was a fair bit of ineffectual complexity that I've stripped out, and it looked like some of these methods were doing more than the actually are. If I have changed anything in a minor way, it's unlikely whatever the previous behavior was that deliberate.I still think there is more work to do in this area. This PR is just refactoring for clarity. I have doubts that the matching algorithms as they stand are performing effectively. Having said that, they do appear to match sometimes and create new records in a way we've all been coping with for some time, so it's hard to objectively measure the effectiveness. This has been "good enough" for some time.
For lower quality imports, it's not clear what's worse, finding a match and appending junk to an existing good record, or creating a new record that doesn't interfere with existing linked MARC records.
For most of the MARC imports I have done in the past, for the usecases I'm aware of, it doesn't matter too much whether a match is found or not (in practice I know they are frequently found, and one MARC is generally equivalent to another, with a preference for newer imports which have done a better job of extracting data), but in the worst case, a match will be missed and a brand new record is created, which has all the metadata and linking info needed. Match or no-match effectively has the same outcome; the latest good metadata is written somewhere, and it will generally be the data retrieved when needed.
I'd expect MARC imports to have a better chance of creating correctly linked and grouped Author, Work, and Edition records, but I don't think this is really being evaluated or tested directly, or even indirectly. The fallback from any gaps is relying on the wiki-like nature of OL and the community to patch things up, which is an ongoing process, and the worst problems do get patched up in this way over time.
TODO:
normalize()
build_titles()
, it's not clear which of the variations are useful or why.match
into catalog.add_book, since that is the only place matching is used: when adding a bookTechnical
Testing
Screenshot
Stakeholders