Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MARC imports w/o ISBN should never match light Title + ISBN, undated and no-author bookseller sourced records #9808

Closed
hornc opened this issue Aug 26, 2024 · 0 comments · Fixed by #9839
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Theme: MARC records Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@hornc
Copy link
Collaborator

hornc commented Aug 26, 2024

There are cases where the Edition Matching Process for Open Library results in matches that we don't want.

  • For instance, when performing a MARC import of a record w/ no ISBN, we should not approve a title-only match (where there is no author available or matched, no date available or matched) with an edition that has an ISBN. The reason being, if the titles are similar but not the same book, we're likely to incorrectly associated that MARC info with the wrong edition & ISBN.

Proposal

From Slack conversation between @hornc and @seabelis :

“Perhaps pre-ISBN dated imports should never match existing records with ISBNs,”
Yes! They absolutely should not be matched.

It looks like this relates to the recent #9794 issue / question.

Background

examples:

The problem of light ISBN only records is meant to be addressed in #9440 , but given we still have to deal with these coming in, and there are many already in place, it seems like we need to mitigate these light records being conflated with fuller but incorrect metadata.

The import system hasn't really been designed to expect missing dates and authors as standard, and does not have any logic to deal with ISBN only records.

Impact

The main problem is that pre-ISBN / pre-copyright dates are being added to ISBN records, and we want to prevent this from happening.

We can assume that a book with an ISBN was published later than a book without an ISBN.

Proposal

There are probably a number of ways we can do this, keeping it simple would be better, but for a starting proposal:

When matching for existing editions,

A found existing match with no date, but an ISBN, should be REJECTED if the import record does not have an ISBN.

In this situation, the found record will be a 'title only match',
so 'title only matches' with ISBN should never match an import record without an ISBN.

(there could be a way to implement this with date ranges, but then we'd be making heuristic date range guesses...
we may want to add a overly cautious date sanity check , but I'm not completely sure what to use. I think there are books published in the 1950s that were assigned SBNs / ISBNs in the late 1960s / 1970s on subsequent reprints, and whether these are always treated as different editions on OL seems flexible)

Contrived simple example

  • Existing: "Just a Title", unset date, unset author, unset publisher, ISBN: 978-000000000-2
  • Import: "Just a Title", 1913, Bob Smith, Publisher: Madeup Editions

Should not match.

Currently it appears that they do, or could match.

In the current code there are existing threshold calculations, there may be a way to get this effect by tweaking existing parameters without adding specific checks for this case.

@hornc hornc added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Theme: MARC records Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] labels Aug 26, 2024
@mekarpeles mekarpeles changed the title MARC imports w/o ISBN should never match light Title + ISBN, undated and no-author bookseller sourced records Improve Edition Matching Process Aug 28, 2024
@hornc hornc changed the title Improve Edition Matching Process MARC imports w/o ISBN should never match light Title + ISBN, undated and no-author bookseller sourced records Aug 28, 2024
@hornc hornc removed the Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] label Aug 29, 2024
@hornc hornc self-assigned this Sep 2, 2024
hornc added a commit to hornc/openlibrary-1 that referenced this issue Sep 2, 2024
unfortunately this test demopnstrates the correct result
of not matching the light record, so it's still unclear why a dated
record matched an undated record with an ISBN.
More investigation is required.
hornc added a commit to hornc/openlibrary-1 that referenced this issue Sep 2, 2024
scottbarnes pushed a commit that referenced this issue Sep 9, 2024
* add a test for #9808 and #9794
* add  failing find_match() test for #9808 and #9794
* simplify find_match() 
* rename find_enriched_match() to find_threshold_match()
@scottbarnes scottbarnes removed the Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Theme: MARC records Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
@hornc @scottbarnes and others