MARC imports w/o ISBN should never match light Title + ISBN, undated and no-author bookseller sourced records #9808
Labels
Affects: Data
Issues that affect book/author metadata or user/account data. [managed]
Lead: @hornc
Issues overseen by Charles (Staff: Data Engineering Lead) [managed]
Module: Import
Issues related to the configuration or use of importbot and other bulk import systems. [managed]
Theme: MARC records
Type: Feature Request
Issue describes a feature or enhancement we'd like to implement. [managed]
There are cases where the Edition Matching Process for Open Library results in matches that we don't want.
Proposal
From Slack conversation between @hornc and @seabelis :
It looks like this relates to the recent #9794 issue / question.
Background
examples:
The problem of light ISBN only records is meant to be addressed in #9440 , but given we still have to deal with these coming in, and there are many already in place, it seems like we need to mitigate these light records being conflated with fuller but incorrect metadata.
The import system hasn't really been designed to expect missing dates and authors as standard, and does not have any logic to deal with ISBN only records.
Impact
The main problem is that pre-ISBN / pre-copyright dates are being added to ISBN records, and we want to prevent this from happening.
We can assume that a book with an ISBN was published later than a book without an ISBN.
Proposal
There are probably a number of ways we can do this, keeping it simple would be better, but for a starting proposal:
When matching for existing editions,
A found existing match with no date, but an ISBN, should be REJECTED if the import record does not have an ISBN.
In this situation, the found record will be a 'title only match',
so 'title only matches' with ISBN should never match an import record without an ISBN.
(there could be a way to implement this with date ranges, but then we'd be making heuristic date range guesses...
we may want to add a overly cautious date sanity check , but I'm not completely sure what to use. I think there are books published in the 1950s that were assigned SBNs / ISBNs in the late 1960s / 1970s on subsequent reprints, and whether these are always treated as different editions on OL seems flexible)
Contrived simple example
Should not match.
Currently it appears that they do, or could match.
In the current code there are existing threshold calculations, there may be a way to get this effect by tweaking existing parameters without adding specific checks for this case.
The text was updated successfully, but these errors were encountered: