Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

Closed
LeadSongDog opened this issue May 9, 2019 · 14 comments
Closed
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Theme: Publishers Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]

Comments

@LeadSongDog
Copy link

Many edition records have no publisher shown, but do have an ISBN. Previous discussion at #895 shows how to get an official spelling of the publisher from the ISBN.

@brad2014 brad2014 added the Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] label May 10, 2019
@brad2014 brad2014 added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed] and removed Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels May 10, 2019
@hornc
Copy link
Collaborator

hornc commented Jun 26, 2019

To give an idea of the scope of this:

in the May 2019 edition dump there are

grep -cv '"publishers":' ol_dump_editions_2019-05-31.txt
1,189,309
editions without publishers. 169,759 of those have ISBNs.

@xayhewalo xayhewalo added Priority: 3 Issues that we can consider at our leisure. [managed] State: Backlogged Theme: Publishers labels Nov 15, 2019
@mekarpeles mekarpeles added the Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] label Dec 18, 2019
@hornc hornc removed their assignment Jan 14, 2020
@LeadSongDog
Copy link
Author

LeadSongDog commented Aug 13, 2024

So #895 just closed without changing code, but the underlying idea here, to exploit ISBN prefixes to fill in blank publisher fields, still has potential to quickly improve data. Can we revisit it? I see no reason why the prefixes could not become readily searchable, yielding a standard spelling for the publisher, and even an approximate year of publication (as adjacent ISBNs will usually be assigned in the same year).

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Aug 14, 2024
@mekarpeles mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] labels Aug 20, 2024
@scottbarnes
Copy link
Collaborator

@LeadSongDog, can you tell me a bit more about the strategy here?

I tried the links in #895, but they no longer seem to work.

Using https://openlibrary.org/books/OL3697910M/General_chemistry as an example, it has ISBN 13 9780618399413, but, to ask a stupid question, how do I determine the prefix? I tried querying for 968-0-618, and saw a lot of publishers, some of which match the already listed publisher, and not every prefix is the same length I see: https://grp.isbn-international.org/search/piid_solr?keys=978-0-618+%28ISBNPrefix%29.

My goal is to understand the process so the issue can be better broken down into steps that can be used to close the issue.

@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Aug 27, 2024
@tfmorris
Copy link
Contributor

To update the 5 year old numbers above #2119 (comment), there are currently 2281048 (2.3M) editions without publishers, double the number from 5 years ago, and 513197 (0.5M) of those have ISBNs.

@LeadSongDog
Copy link
Author

Only the vaguest ideas on implementation, but…

One might start from an edition record with isbn but not a publisher identified.
Searching on minimally truncated versions of an isbn returns something like these:

https://openlibrary.org/search?q=isbn%3A+97806183994*&mode=everything or https://openlibrary.org/search?q=isbn%3A+9780618399*&mode=everything or https://openlibrary.org/search?q=isbn%3A+978061839*&mode=everything

A quick comparison of the results shows that the closest ISBNs had the most similarities, even revealing variant spellings for the publisher and authors. The shortest (first) list above includes several spellings for Houghton Mifflin seen at Q390074.

It might be simpler to start from a dump of editions, then sort on isbn?

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Aug 28, 2024
@scottbarnes
Copy link
Collaborator

I know I am being a bit slow here, but the part that isn't fully clear to me is how to get from the ISBN to the publisher, or at least to the prefix. I think we might need to be able to do that on a large scale to do it on the data dump.

@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Aug 29, 2024
@hornc
Copy link
Collaborator

hornc commented Aug 29, 2024

@scottbarnes to answer the 'how to get the prefix', you can use the isbnlib in Python:

>>> import isbnlib
>>> isbnlib.mask('9780618399413')
'978-0-618-39941-3'

and get all but the last two groups.

The prefixes are assigned by a registry, and the ranges are updated every so often. isbnlib keeps these relatively up-to-date.

Your example for 968-0-618 is interesting. I'm surprised that it returns "Clarion Books" (a childrens book publisher) and the more correct looking Houghton Mifflin.

I was going to say that doesn't make sense to me, unless it is different imprint levels that are owned by the same parent company, but that might be what is happening here. Clarion Books is owned by Harper Collins, and Harper Collins has bought Houghton Mifflin at some point, so maybe that range has been transferred? This might make it more difficult to extract the original publisher (because they keep eating each other).

It seemed like a reasonable approach to extract publishers from ISBN prefixes, but your example seems to show that this can change over time. Even without this complicating factor, an ISBN prefix lookup might give results at a different imprint level, which may not be that useful for someone searching for bibliographic publisher metadata.

That leads to the question: what use-case does populating publisher from ISBN serve?

OL doesn't use publisher to disambiguate between books, primarily because publisher is just an uncontrolled string, and there is so much variations in forms and imprints it doesn't add much. Any item with an ISBN already has that as a more unambiguous id for any kind of look up.

The publisher string determined by an ISBN look is probably quite likely to not appear on the book in that form.

@LeadSongDog
Copy link
Author

LeadSongDog commented Aug 29, 2024

Particularly for textbooks, some work titles, such as « Chemistry » or « Calculus » are reused by multiple works. To merge the work records, they must be disambiguated too. By identifying the publisher of an edition, it is often possible to (A)determine more completely the author(s) for each work (as cataloguers and online merchants often only list author surnames) and (B)determine which synonymous work the edition is from. Work-merging will still require the merged work records to agree on linked authors.

@tfmorris
Copy link
Contributor

Like @hornc, I'm suspicious of this approach. It doesn't seem like a reliable way to source metadata. The original problem (no publisher stated) is created by using poor quality metadata to start with, so let's not compound it.

To take a random example from early in a recent edition dump https://openlibrary.org/books/OL11812091M
It was originally imported from a threadbare Amazon page, which is where the trouble started https://www.amazon.com/gp/product/0976511037
The ISBN prefix is registered to "University Book Exchange" https://grp.isbn-international.org/search/piid_solr?keys=978-0-9765110
but the copyright page lists "Independent Press," ttps://archive.org/details/whengreekgoatssi0000ceru/page/n3/mode/1up
the same thing that WorldCat has: https://search.worldcat.org/title/437125764?tab=details
and which appears in the MARC record that IA has associated with it: https://ia803401.us.archive.org/fetchmarc.php?path=/0/items/whengreekgoatssi0000ceru/whengreekgoatssi0000ceru_marc.xml

I'm not sure why the IA MARC record wasn't used to populate the publisher, but it strikes me that even if it weren't available, using WorldCat would be better than trying to guess from the ISBN.

@LeadSongDog
Copy link
Author

@tfmorris That’s an interesting example.

Of course I agree that low quality metadata sources (goodreadsss, AMZ, BwB) should not be amplified, but that still seems to be accepted OL practice. I’d prefer to simply delete what can’t be verified, but I can’t. Would you prefer to have us just leave the mess rather than clean it up?

The Promise record was attached to the edition long after the book had been scanned into IA: https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff
While the scan correctly showed « Independent Press » as the publisher, the Promise record incorrectly showed the ECU bookstore « University Book Exchange ».

The problem was aggravated in that the low quality source (Promise) was allowed to overwrite the high quality source (scan). There ought to be logic preventing this.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Aug 30, 2024
@scottbarnes
Copy link
Collaborator

scottbarnes commented Aug 30, 2024

For my part, I am not convinced we can, with confidence, get from the ISBN to the correct publisher. But I do agree that low quality imports should not overwrite higher quality metadata, though actual overwriting of populated fields shouldn't currently be happening, and if it is, I think that is likely a bug to be addressed.

In the case of https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff, again I apologize for being slow, @LeadSongDog, but I should be upfront about my ignorance to save everyone time: can you help explain the possible harm from adding promise:bwb_daily_pallets_2021-01-26 to source_records, and urn:bwbsku:W1-AUZ-082 the local_id?

We may be getting afield of the specific issue of using the ISBN prefix to populate the publisher. Unless there is way to do this with confidence, I am inclined to close this specific issue.

However, there is still more to do in terms of improving quality. Hopefully the changes in #9587 and #9574, along with the forthcoming changes in #9753 and PRs to address #9808 and #9831 will help limit the light records. It may also be the case that the suggestion in #9808 not to match MARC imports without an ISBN to an existing edition with only title + ISBN should be extended to all imports, but that may be a discussion for elsewhere.

@LeadSongDog
Copy link
Author

@scottbarnes
No apology needed. We all have learning to do, me more than most. I wonder if recording the Promise pallet associated with the (then unscanned) edition has a point when the IA record from the subsequent scan shows a different “Old_pallet IA-NS-0000662”.

@scottbarnes
Copy link
Collaborator

I see that IA-NS-0000662 is listed as one of the pallets at https://archive.org/details/bwb_daily_pallets_2021-01-26, so there seems to be at least some connection. I am unsure of how the exact pallet of the multiple there was associated with https://archive.org/details/whengreekgoatssi0000ceru, however.

I have found it useful at times to use the source record, e.g. bwb_daily_pallets_2021-01-26, to go look at the metadata from the promise item itself to try to understand a bit more about what happened with a particular import.

@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Sep 2, 2024
@scottbarnes
Copy link
Collaborator

Pending a way to confidently add publisher records from the ISBN prefix, I am going to close this as not (currently) planned.

@scottbarnes scottbarnes closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Theme: Publishers Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]
Projects
None yet
Development

No branches or pull requests

7 participants