Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MARC records incorrectly matching ISBN promise item records #9839

Merged
merged 14 commits into from
Sep 9, 2024

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Sep 2, 2024

Closes #9808
Closes #9794

After some investigation, this PR:

  • removes the exact match check that was matching existing light records with only title + ISBN against MARC records only on the title field, and skipped the more intelligent threshold match method that at least tries to score matches.
  • Fixes a longstanding bug where authors were only being compared on edition level authors, even though in general authors only exist on Work records in OL. This problem was masked by most matches occurring on direct ISBN, or title string only matches.
  • Renames the method to find_threshold_match() and makes sure it is used for performing edition matches
  • Adds a specific test for MARC imports w/o ISBN should never match light Title + ISBN, undated and no-author bookseller sourced records #9808
  • Fixes existing test data which had incorrect author data formats, and did not link editions to works. These issues were masked by the previous title only matching.

Investigation notes

The THRESHOLD checking code seems to do the right thing, but from following through the add_book load code based on your initial investigation in #9794, that threshold checking isn't used for this import flow. If there is a title match, it looks like it counts as an absolute match, and there is also specific code to trigger promise item records getting overwritten by MARC records.

At this stage I'm not even sure that that special case is being triggered here; the record might get added to anyway?

I'm not sure where the 'don't count the match if existing record has an ISBN' code should go. I'm still trying to make sense out of why the threshold code, which appears to be the right thing, isn't being used here.

Feel free to review and let me know if you have any ideas on how to improve this.

Technical

Testing

Screenshot

Stakeholders

@scottbarnes

unfortunately this test demopnstrates the correct result
of not matching the light record, so it's still unclear why a dated
record matched an undated record with an ISBN.
More investigation is required.
@hornc hornc marked this pull request as draft September 2, 2024 23:52
@hornc
Copy link
Collaborator Author

hornc commented Sep 5, 2024

Following through how and where the threshold match code is used:

  • add_book/match.py editions_match() uses threshold_match() (Only 1 usage)
  • add_book/init.py find_enriched_match() uses editions_match() (the only use)
  • find_enriched_match() is used in add_book/init.py find_match()

in a three stage check:

find_quick_match(rec)
else:
find_exact_match(rec, edition_pool)
else:
find_enriched_match(rec, edition_pool)

def find_match(rec, edition_pool) -> str | None:
"""Use rec to try to find an existing edition key that matches."""
match = find_quick_match(rec)
if not match:
match = find_exact_match(rec, edition_pool)
if not match:
match = find_enriched_match(rec, edition_pool)
return match

which is effectively:

def find_match(rec: dict, edition_pool: dict) -> str | None:
    return find_quick_match(rec) or \ 
           find_exact_match(rec, edition_pool) or \
           find_enriched_match(rec, edition_pool)

changing the return type to str | None rather than str | bool
because previously find_match() -> str | None | bool
to confirm the refactor hasn't broken anything (tests should pass again)
it's failing my tests on whitespace, tests which pass
locally, and makes less than ideal 'suggestions' as commits
@hornc hornc marked this pull request as ready for review September 6, 2024 02:30
@hornc hornc changed the title Investigation for MARC record incorrectly matching ISBN promise item records Fix MARC records incorrectly matching ISBN promise item records Sep 6, 2024
@hornc hornc requested a review from scottbarnes September 6, 2024 02:47
Copy link
Collaborator

@scottbarnes scottbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I also double checked by running the original import from #9794 locally, and the the correct edition is matched with the marc_columbia import.

@scottbarnes scottbarnes merged commit 1894cb4 into internetarchive:master Sep 9, 2024
4 checks passed
@hornc hornc deleted the ISSUE9808 branch September 9, 2024 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants