Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and move old 'catalog.merge' naming to 'catalog.add_book.match' for import record matching #8296

Merged
merged 32 commits into from
Dec 5, 2023

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Sep 16, 2023

Closes #2410

This started as an attempt to figure why some of the title processing seemed inconsistent with some title string normalization code. Trailing dots increased the number of variations in a way that did not seem deliberate, and contained duplicates.

It has turned into a clean up of all the old catalog.merge code which seems only to be concerned with matching existing records when a new record is imported (add_book).

The title normalisation code, mk_norm() and normalize() are only used for matching normalisation, not display, so I have moved them out of general catalog.utils and other catalog.add_book modules and grouped them all together in catalog.add_book.match. The normalisations they perform are a bit arbitrary and are very tightly bound to the specific matching algorithms (now) contained in catalog.match.

Now all that code should be grouped together, and all existing tests live in the same place. I've tried to uncouple all the match specific methods out of catalog.add_book to reduce confusion, and added a handful more test cases, typehints, comments, and some renaming to make it all a bit clearer. I'm trying not to change the behavior at all. There was a fair bit of ineffectual complexity that I've stripped out, and it looked like some of these methods were doing more than the actually are. If I have changed anything in a minor way, it's unlikely whatever the previous behavior was that deliberate.

I still think there is more work to do in this area. This PR is just refactoring for clarity. I have doubts that the matching algorithms as they stand are performing effectively. Having said that, they do appear to match sometimes and create new records in a way we've all been coping with for some time, so it's hard to objectively measure the effectiveness. This has been "good enough" for some time.

For lower quality imports, it's not clear what's worse, finding a match and appending junk to an existing good record, or creating a new record that doesn't interfere with existing linked MARC records.
For most of the MARC imports I have done in the past, for the usecases I'm aware of, it doesn't matter too much whether a match is found or not (in practice I know they are frequently found, and one MARC is generally equivalent to another, with a preference for newer imports which have done a better job of extracting data), but in the worst case, a match will be missed and a brand new record is created, which has all the metadata and linking info needed. Match or no-match effectively has the same outcome; the latest good metadata is written somewhere, and it will generally be the data retrieved when needed.

I'd expect MARC imports to have a better chance of creating correctly linked and grouped Author, Work, and Edition records, but I don't think this is really being evaluated or tested directly, or even indirectly. The fallback from any gaps is relying on the wiki-like nature of OL and the community to patch things up, which is an ongoing process, and the worst problems do get patched up in this way over time.

TODO:

  • add some more tests to normalize()
    • 'A Title.'
    • 'A Title. '
    • 'A Title .'
  • Try to document the intent of build_titles(), it's not clear which of the variations are useful or why.
  • rename catalog.merge to catlog.match / merge_marc -> match, because it is all about matching edition records, not MARC, and not merging
  • move all former catalog.merge code re-named to match into catalog.add_book, since that is the only place matching is used: when adding a book

Technical

Testing

Screenshot

Stakeholders

@hornc hornc marked this pull request as draft September 16, 2023 06:34
@hornc
Copy link
Collaborator Author

hornc commented Sep 16, 2023

resolved
Failing test output:

openlibrary/tests/catalog/test_utils.py:236: AssertionError

    def test_expand_record():
        # used in openlibrary.catalog.add_book.load()
        # when trying to find an existing edition match
        edition = valid_edition.copy()
        expanded_record = expand_record(edition)
        assert isinstance(expanded_record['titles'], list)
>       assert expanded_record['titles'] == [
            'A test full title : subtitle (parens).',
            'a test full title subtitle (parens)',
            'test full title : subtitle (parens).',
            'test full title subtitle (parens)',
        ]
E       AssertionError: assert ['test full t...tle (parens)'] == ['A test full...tle (parens)']
E         At index 0 diff: 'test full title subtitle (parens)' != 'A test full title : subtitle (parens).'
E         Left contains one more item: 'a test full title subtitle (parens)'
E         Use -v to get more diff

The input to this is:

valid_edition = {
    'title': 'A test title (parens)',
    'full_title': 'A test full title : subtitle (parens).',  # required, and set by add_book.load()
    'source_records': ['ia:test-source'],
}

Not sure what the input to add_book.load() to was to get this full_title -- presumably the trailing period was from the original supplied title, and title is the shortened form? Is this how add_book.load() works?

@hornc
Copy link
Collaborator Author

hornc commented Sep 16, 2023

resolved
Looks like full_title is a constructed field and is set here

def find_match(rec, edition_pool) -> str | None:
"""Use rec to try to find an existing edition key that matches."""
match = find_quick_match(rec)
if not match:
match = find_exact_match(rec, edition_pool)
if not match:
# Add 'full_title' to the rec by conjoining 'title' and 'subtitle'.
# expand_record() uses this for matching.
rec['full_title'] = rec['title']
if subtitle := rec.get('subtitle'):
rec['full_title'] += ' ' + subtitle
match = find_enriched_match(rec, edition_pool)
return match

subtitle may be provided as input to the import OR be split out from a title containing a colon.

full_title re-assembles title + subtitle in a standard way.

subtitle splitting:

def split_subtitle(full_title):
"""
Splits a title into (title, subtitle),
strips parenthetical tags. Used for bookseller
catalogs which do not pre-separate subtitles.
:param str full_title:
:rtype: (str, str | None)
:return: (title, subtitle | None)
"""
# strip parenthetical blocks wherever they occur
# can handle 1 level of nesting
re_parens_strip = re.compile(r'\(([^\)\(]*|[^\(]*\([^\)]*\)[^\)]*)\)')
clean_title = re.sub(re_parens_strip, '', full_title)
titles = clean_title.split(':')
subtitle = titles.pop().strip() if len(titles) > 1 else None
title = ISBD_UNIT_PUNCT.join([unit.strip() for unit in titles])
return (title, subtitle)

Looks like there are some naming inconsistencies around titles -- title is sometimes the original "full title" and sometimes the split fore-title .

I'm not seeing that the tests take this into account with the test data provided, which may not be realistic or possible.

@hornc
Copy link
Collaborator Author

hornc commented Sep 18, 2023

resolved in a slightly different way -- the lines have been removed

DRY:

rec2['full_title'] = existing.title
if existing.subtitle:
rec2['full_title'] += ' ' + existing.subtitle

and

rec['full_title'] = rec['title']
if subtitle := rec.get('subtitle'):
rec['full_title'] += ' ' + subtitle

Move to:

def expand_record(rec: dict) -> dict[str, str | list[str]]:

@@ -991,9 +977,11 @@ def test_find_match_is_used_when_looking_for_edition_matches(mock_site) -> None:
This also indirectly tests `merge_marc.editions_match()` (even though it's
not a MARC record.
"""
# Unfortunately this Work level author is totally irrelevant to the matching
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a test issue? Do live Edition objects inherit authors from their Works?

@hornc hornc added Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] labels Sep 18, 2023
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still a draft, but I'm not sure how long it'll stay open for review, so wanted to add some thoughts now. Doing a reasonable job for Amazon (or worse BWB) is likely to be an impossible task. Things that a title could be polluted with include, but are not limited to, author names, series name, words like "by" "trans." "ed.", etc. Author strings might contain the names of multiple authors mushed together, role names (e.g. trans.).

Something which should be added to the test cases (in addition to the above if you decide to attempt to address Amazon data, which I don't recommend) are some examples which include things like, ": a novel", "un roman", "ein Roman", etc for all the other languages), "Volume", "Vol."

Lastly, a rule based approach is likely to be very difficult to get right. Have you considered training a classifier for this task?

openlibrary/catalog/utils/__init__.py Outdated Show resolved Hide resolved
openlibrary/catalog/utils/__init__.py Show resolved Hide resolved
openlibrary/catalog/utils/__init__.py Show resolved Hide resolved
norm = norm[4:]
elif norm.startswith('a '):
norm = norm[2:]
norm = match.normalize(s).replace(' and ', '')
return norm.replace(' ', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word boundaries are important information. It seems odd to remove them.

@hornc hornc changed the title Improve title normalization consistency Refactor and move old 'catalog.merge' naming to 'catalog.add_book.match' for import record matching Sep 19, 2023
@hornc hornc marked this pull request as ready for review October 25, 2023 00:11
@mekarpeles mekarpeles self-assigned this Oct 30, 2023
@mekarpeles
Copy link
Member

Merging + monitoring -- spoke w/ @hornc today, mostly a refactor w/ more tests

@mekarpeles mekarpeles merged commit f6a69b6 into internetarchive:master Dec 5, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent title processing by catalog.merge.merge_marc.build_titles()
4 participants