Refactor and move old 'catalog.merge' naming to 'catalog.add_book.match' for import record matching #8296

hornc · 2023-09-16T05:58:02Z

This started as an attempt to figure why some of the title processing seemed inconsistent with some title string normalization code. Trailing dots increased the number of variations in a way that did not seem deliberate, and contained duplicates.

It has turned into a clean up of all the old catalog.merge code which seems only to be concerned with matching existing records when a new record is imported (add_book).

The title normalisation code, mk_norm() and normalize() are only used for matching normalisation, not display, so I have moved them out of general catalog.utils and other catalog.add_book modules and grouped them all together in catalog.add_book.match. The normalisations they perform are a bit arbitrary and are very tightly bound to the specific matching algorithms (now) contained in catalog.match.

Now all that code should be grouped together, and all existing tests live in the same place. I've tried to uncouple all the match specific methods out of catalog.add_book to reduce confusion, and added a handful more test cases, typehints, comments, and some renaming to make it all a bit clearer. I'm trying not to change the behavior at all. There was a fair bit of ineffectual complexity that I've stripped out, and it looked like some of these methods were doing more than the actually are. If I have changed anything in a minor way, it's unlikely whatever the previous behavior was that deliberate.

I still think there is more work to do in this area. This PR is just refactoring for clarity. I have doubts that the matching algorithms as they stand are performing effectively. Having said that, they do appear to match sometimes and create new records in a way we've all been coping with for some time, so it's hard to objectively measure the effectiveness. This has been "good enough" for some time.

For lower quality imports, it's not clear what's worse, finding a match and appending junk to an existing good record, or creating a new record that doesn't interfere with existing linked MARC records.
For most of the MARC imports I have done in the past, for the usecases I'm aware of, it doesn't matter too much whether a match is found or not (in practice I know they are frequently found, and one MARC is generally equivalent to another, with a preference for newer imports which have done a better job of extracting data), but in the worst case, a match will be missed and a brand new record is created, which has all the metadata and linking info needed. Match or no-match effectively has the same outcome; the latest good metadata is written somewhere, and it will generally be the data retrieved when needed.

I'd expect MARC imports to have a better chance of creating correctly linked and grouped Author, Work, and Edition records, but I don't think this is really being evaluated or tested directly, or even indirectly. The fallback from any gaps is relying on the wiki-like nature of OL and the community to patch things up, which is an ongoing process, and the worst problems do get patched up in this way over time.

TODO:

add some more tests to normalize()
- 'A Title.'
- 'A Title. '
- 'A Title .'
Try to document the intent of build_titles(), it's not clear which of the variations are useful or why.
rename catalog.merge to catlog.match / merge_marc -> match, because it is all about matching edition records, not MARC, and not merging
move all former catalog.merge code re-named to match into catalog.add_book, since that is the only place matching is used: when adding a book

Technical

Testing

Screenshot

Stakeholders

hornc · 2023-09-16T06:42:47Z

resolved
Failing test output:

openlibrary/tests/catalog/test_utils.py:236: AssertionError

    def test_expand_record():
        # used in openlibrary.catalog.add_book.load()
        # when trying to find an existing edition match
        edition = valid_edition.copy()
        expanded_record = expand_record(edition)
        assert isinstance(expanded_record['titles'], list)
>       assert expanded_record['titles'] == [
            'A test full title : subtitle (parens).',
            'a test full title subtitle (parens)',
            'test full title : subtitle (parens).',
            'test full title subtitle (parens)',
        ]
E       AssertionError: assert ['test full t...tle (parens)'] == ['A test full...tle (parens)']
E         At index 0 diff: 'test full title subtitle (parens)' != 'A test full title : subtitle (parens).'
E         Left contains one more item: 'a test full title subtitle (parens)'
E         Use -v to get more diff

The input to this is:

valid_edition = {
    'title': 'A test title (parens)',
    'full_title': 'A test full title : subtitle (parens).',  # required, and set by add_book.load()
    'source_records': ['ia:test-source'],
}

Not sure what the input to add_book.load() to was to get this full_title -- presumably the trailing period was from the original supplied title, and title is the shortened form? Is this how add_book.load() works?

hornc · 2023-09-16T06:59:04Z

resolved
Looks like full_title is a constructed field and is set here

openlibrary/openlibrary/catalog/add_book/__init__.py

Lines 827 to 842 in c561294

    
           def find_match(rec, edition_pool) -> str | None: 
        
               """Use rec to try to find an existing edition key that matches.""" 
        
               match = find_quick_match(rec) 
        
               if not match: 
        
                   match = find_exact_match(rec, edition_pool) 
        
               if not match: 
        
                   # Add 'full_title' to the rec by conjoining 'title' and 'subtitle'. 
        
                   # expand_record() uses this for matching. 
        
                   rec['full_title'] = rec['title'] 
        
                   if subtitle := rec.get('subtitle'): 
        
                       rec['full_title'] += ' ' + subtitle 
        
                   match = find_enriched_match(rec, edition_pool) 
        
               return match

subtitle may be provided as input to the import OR be split out from a title containing a colon.

full_title re-assembles title + subtitle in a standard way.

subtitle splitting:

openlibrary/openlibrary/catalog/add_book/__init__.py

Lines 182 to 201 in c561294

    
           def split_subtitle(full_title): 
        
               """ 
        
               Splits a title into (title, subtitle), 
        
               strips parenthetical tags. Used for bookseller 
        
               catalogs which do not pre-separate subtitles. 
        
               :param str full_title: 
        
               :rtype: (str, str | None) 
        
               :return: (title, subtitle | None) 
        
               """ 
        
               # strip parenthetical blocks wherever they occur 
        
               # can handle 1 level of nesting 
        
               re_parens_strip = re.compile(r'\(([^\)\(]*|[^\(]*\([^\)]*\)[^\)]*)\)') 
        
               clean_title = re.sub(re_parens_strip, '', full_title) 
        
               titles = clean_title.split(':') 
        
               subtitle = titles.pop().strip() if len(titles) > 1 else None 
        
               title = ISBD_UNIT_PUNCT.join([unit.strip() for unit in titles]) 
        
               return (title, subtitle)

Looks like there are some naming inconsistencies around titles -- title is sometimes the original "full title" and sometimes the split fore-title .

I'm not seeing that the tests take this into account with the test data provided, which may not be realistic or possible.

hornc · 2023-09-18T06:32:14Z

resolved in a slightly different way -- the lines have been removed

DRY:

openlibrary/openlibrary/catalog/add_book/match.py

Lines 42 to 44 in c561294

    
           rec2['full_title'] = existing.title 
        
           if existing.subtitle: 
        
               rec2['full_title'] += ' ' + existing.subtitle

and

openlibrary/openlibrary/catalog/add_book/__init__.py

Lines 836 to 838 in c561294

    
           rec['full_title'] = rec['title'] 
        
           if subtitle := rec.get('subtitle'): 
        
               rec['full_title'] += ' ' + subtitle

Move to:

openlibrary/openlibrary/catalog/utils/__init__.py

Line 295 in c561294

def expand_record(rec: dict) -> dict[str, str | list[str]]:

hornc · 2023-09-18T09:38:05Z

openlibrary/catalog/add_book/tests/test_add_book.py

@@ -991,9 +977,11 @@ def test_find_match_is_used_when_looking_for_edition_matches(mock_site) -> None:
    This also indirectly tests `merge_marc.editions_match()` (even though it's
    not a MARC record.
    """
+    # Unfortunately this Work level author is totally irrelevant to the matching


Is this just a test issue? Do live Edition objects inherit authors from their Works?

openlibrary/catalog/utils/__init__.py

tfmorris

I know this is still a draft, but I'm not sure how long it'll stay open for review, so wanted to add some thoughts now. Doing a reasonable job for Amazon (or worse BWB) is likely to be an impossible task. Things that a title could be polluted with include, but are not limited to, author names, series name, words like "by" "trans." "ed.", etc. Author strings might contain the names of multiple authors mushed together, role names (e.g. trans.).

Something which should be added to the test cases (in addition to the above if you decide to attempt to address Amazon data, which I don't recommend) are some examples which include things like, ": a novel", "un roman", "ein Roman", etc for all the other languages), "Volume", "Vol."

Lastly, a rule based approach is likely to be very difficult to get right. Have you considered training a classifier for this task?

openlibrary/catalog/utils/__init__.py

tfmorris · 2023-09-18T17:44:09Z

openlibrary/catalog/utils/__init__.py

-        norm = norm[4:]
-    elif norm.startswith('a '):
-        norm = norm[2:]
+    norm = match.normalize(s).replace(' and ', '')
    return norm.replace(' ', '')


Word boundaries are important information. It seems odd to remove them.

openlibrary/catalog/add_book/match.py

closes internetarchive#2410

for more information, see https://pre-commit.ci

mekarpeles · 2023-12-05T21:38:48Z

Merging + monitoring -- spoke w/ @hornc today, mostly a refactor w/ more tests

hornc force-pushed the title_tests branch from a4e0980 to 0b01b62 Compare September 16, 2023 06:03

hornc marked this pull request as draft September 16, 2023 06:34

hornc force-pushed the title_tests branch from 18d7342 to c5fed82 Compare September 18, 2023 04:22