Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

LeadSongDog · 2019-05-09T18:36:47Z

Many edition records have no publisher shown, but do have an ISBN. Previous discussion at #895 shows how to get an official spelling of the publisher from the ISBN.

hornc · 2019-06-26T08:44:12Z

To give an idea of the scope of this:

in the May 2019 edition dump there are

grep -cv '"publishers":' ol_dump_editions_2019-05-31.txt
1,189,309
editions without publishers. 169,759 of those have ISBNs.

LeadSongDog · 2024-08-13T14:38:21Z

So #895 just closed without changing code, but the underlying idea here, to exploit ISBN prefixes to fill in blank publisher fields, still has potential to quickly improve data. Can we revisit it? I see no reason why the prefixes could not become readily searchable, yielding a standard spelling for the publisher, and even an approximate year of publication (as adjacent ISBNs will usually be assigned in the same year).

scottbarnes · 2024-08-27T19:55:17Z

@LeadSongDog, can you tell me a bit more about the strategy here?

I tried the links in #895, but they no longer seem to work.

Using https://openlibrary.org/books/OL3697910M/General_chemistry as an example, it has ISBN 13 9780618399413, but, to ask a stupid question, how do I determine the prefix? I tried querying for 968-0-618, and saw a lot of publishers, some of which match the already listed publisher, and not every prefix is the same length I see: https://grp.isbn-international.org/search/piid_solr?keys=978-0-618+%28ISBNPrefix%29.

My goal is to understand the process so the issue can be better broken down into steps that can be used to close the issue.

tfmorris · 2024-08-28T04:27:14Z

To update the 5 year old numbers above #2119 (comment), there are currently 2281048 (2.3M) editions without publishers, double the number from 5 years ago, and 513197 (0.5M) of those have ISBNs.

LeadSongDog · 2024-08-28T06:01:29Z

Only the vaguest ideas on implementation, but…

One might start from an edition record with isbn but not a publisher identified.
Searching on minimally truncated versions of an isbn returns something like these:

https://openlibrary.org/search?q=isbn%3A+97806183994*&mode=everything or https://openlibrary.org/search?q=isbn%3A+9780618399*&mode=everything or https://openlibrary.org/search?q=isbn%3A+978061839*&mode=everything

A quick comparison of the results shows that the closest ISBNs had the most similarities, even revealing variant spellings for the publisher and authors. The shortest (first) list above includes several spellings for Houghton Mifflin seen at Q390074.

It might be simpler to start from a dump of editions, then sort on isbn?

scottbarnes · 2024-08-29T05:07:50Z

I know I am being a bit slow here, but the part that isn't fully clear to me is how to get from the ISBN to the publisher, or at least to the prefix. I think we might need to be able to do that on a large scale to do it on the data dump.

hornc · 2024-08-29T06:42:49Z

@scottbarnes to answer the 'how to get the prefix', you can use the isbnlib in Python:

>>> import isbnlib
>>> isbnlib.mask('9780618399413')
'978-0-618-39941-3'

and get all but the last two groups.

The prefixes are assigned by a registry, and the ranges are updated every so often. isbnlib keeps these relatively up-to-date.

Your example for 968-0-618 is interesting. I'm surprised that it returns "Clarion Books" (a childrens book publisher) and the more correct looking Houghton Mifflin.

I was going to say that doesn't make sense to me, unless it is different imprint levels that are owned by the same parent company, but that might be what is happening here. Clarion Books is owned by Harper Collins, and Harper Collins has bought Houghton Mifflin at some point, so maybe that range has been transferred? This might make it more difficult to extract the original publisher (because they keep eating each other).

It seemed like a reasonable approach to extract publishers from ISBN prefixes, but your example seems to show that this can change over time. Even without this complicating factor, an ISBN prefix lookup might give results at a different imprint level, which may not be that useful for someone searching for bibliographic publisher metadata.

That leads to the question: what use-case does populating publisher from ISBN serve?

OL doesn't use publisher to disambiguate between books, primarily because publisher is just an uncontrolled string, and there is so much variations in forms and imprints it doesn't add much. Any item with an ISBN already has that as a more unambiguous id for any kind of look up.

The publisher string determined by an ISBN look is probably quite likely to not appear on the book in that form.

LeadSongDog · 2024-08-29T16:27:05Z

Particularly for textbooks, some work titles, such as « Chemistry » or « Calculus » are reused by multiple works. To merge the work records, they must be disambiguated too. By identifying the publisher of an edition, it is often possible to (A)determine more completely the author(s) for each work (as cataloguers and online merchants often only list author surnames) and (B)determine which synonymous work the edition is from. Work-merging will still require the merged work records to agree on linked authors.

tfmorris · 2024-08-29T18:56:58Z

Like @hornc, I'm suspicious of this approach. It doesn't seem like a reliable way to source metadata. The original problem (no publisher stated) is created by using poor quality metadata to start with, so let's not compound it.

To take a random example from early in a recent edition dump https://openlibrary.org/books/OL11812091M
It was originally imported from a threadbare Amazon page, which is where the trouble started https://www.amazon.com/gp/product/0976511037
The ISBN prefix is registered to "University Book Exchange" https://grp.isbn-international.org/search/piid_solr?keys=978-0-9765110
but the copyright page lists "Independent Press," ttps://archive.org/details/whengreekgoatssi0000ceru/page/n3/mode/1up
the same thing that WorldCat has: https://search.worldcat.org/title/437125764?tab=details
and which appears in the MARC record that IA has associated with it: https://ia803401.us.archive.org/fetchmarc.php?path=/0/items/whengreekgoatssi0000ceru/whengreekgoatssi0000ceru_marc.xml

I'm not sure why the IA MARC record wasn't used to populate the publisher, but it strikes me that even if it weren't available, using WorldCat would be better than trying to guess from the ISBN.

LeadSongDog · 2024-08-30T14:08:04Z

@tfmorris That’s an interesting example.

Of course I agree that low quality metadata sources (goodreadsss, AMZ, BwB) should not be amplified, but that still seems to be accepted OL practice. I’d prefer to simply delete what can’t be verified, but I can’t. Would you prefer to have us just leave the mess rather than clean it up?

The Promise record was attached to the edition long after the book had been scanned into IA: https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff
While the scan correctly showed « Independent Press » as the publisher, the Promise record incorrectly showed the ECU bookstore « University Book Exchange ».

The problem was aggravated in that the low quality source (Promise) was allowed to overwrite the high quality source (scan). There ought to be logic preventing this.

scottbarnes · 2024-08-30T16:51:37Z

For my part, I am not convinced we can, with confidence, get from the ISBN to the correct publisher. But I do agree that low quality imports should not overwrite higher quality metadata, though actual overwriting of populated fields shouldn't currently be happening, and if it is, I think that is likely a bug to be addressed.

In the case of https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff, again I apologize for being slow, @LeadSongDog, but I should be upfront about my ignorance to save everyone time: can you help explain the possible harm from adding promise:bwb_daily_pallets_2021-01-26 to source_records, and urn:bwbsku:W1-AUZ-082 the local_id?

We may be getting afield of the specific issue of using the ISBN prefix to populate the publisher. Unless there is way to do this with confidence, I am inclined to close this specific issue.

However, there is still more to do in terms of improving quality. Hopefully the changes in #9587 and #9574, along with the forthcoming changes in #9753 and PRs to address #9808 and #9831 will help limit the light records. It may also be the case that the suggestion in #9808 not to match MARC imports without an ISBN to an existing edition with only title + ISBN should be extended to all imports, but that may be a discussion for elsewhere.

LeadSongDog · 2024-08-31T01:44:55Z

@scottbarnes
No apology needed. We all have learning to do, me more than most. I wonder if recording the Promise pallet associated with the (then unscanned) edition has a point when the IA record from the subsequent scan shows a different “Old_pallet IA-NS-0000662”.

scottbarnes · 2024-08-31T13:22:07Z

I see that IA-NS-0000662 is listed as one of the pallets at https://archive.org/details/bwb_daily_pallets_2021-01-26, so there seems to be at least some connection. I am unsure of how the exact pallet of the multiple there was associated with https://archive.org/details/whengreekgoatssi0000ceru, however.

I have found it useful at times to use the source record, e.g. bwb_daily_pallets_2021-01-26, to go look at the metadata from the promise item itself to try to understand a bit more about what happened with a particular import.

scottbarnes · 2024-09-02T13:36:28Z

Pending a way to confidently add publisher records from the ISBN prefix, I am going to close this as not (currently) planned.

brad2014 added the Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] label May 10, 2019

brad2014 assigned hornc May 10, 2019

xayhewalo added Priority: 3 Issues that we can consider at our leisure. [managed] State: Backlogged Theme: Publishers labels Nov 15, 2019

mekarpeles added the Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] label Dec 18, 2019

hornc removed their assignment Jan 14, 2020

xayhewalo removed the State: Backlogged label Mar 17, 2020

LeadSongDog mentioned this issue May 21, 2020

Add an option to add synonyms to publishers, authors and places #3470

Open

github-actions bot added the Needs: Response Issues which require feedback from lead label Aug 14, 2024

mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] labels Aug 20, 2024

scottbarnes removed the Needs: Response Issues which require feedback from lead label Aug 27, 2024

github-actions bot added the Needs: Response Issues which require feedback from lead label Aug 28, 2024

scottbarnes removed the Needs: Response Issues which require feedback from lead label Aug 29, 2024

tfmorris mentioned this issue Aug 30, 2024

MARC records listed as source records not being used (or used fully?) #9831

Open

3 tasks

github-actions bot added the Needs: Response Issues which require feedback from lead label Aug 30, 2024

scottbarnes removed the Needs: Response Issues which require feedback from lead label Sep 2, 2024

scottbarnes closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

LeadSongDog commented May 9, 2019

hornc commented Jun 26, 2019

LeadSongDog commented Aug 13, 2024 •

edited

Loading

scottbarnes commented Aug 27, 2024

tfmorris commented Aug 28, 2024

LeadSongDog commented Aug 28, 2024

scottbarnes commented Aug 29, 2024

hornc commented Aug 29, 2024 •

edited

Loading

LeadSongDog commented Aug 29, 2024 •

edited

Loading

tfmorris commented Aug 29, 2024

LeadSongDog commented Aug 30, 2024

scottbarnes commented Aug 30, 2024 •

edited

Loading

LeadSongDog commented Aug 31, 2024

scottbarnes commented Aug 31, 2024

scottbarnes commented Sep 2, 2024

Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

Comments

LeadSongDog commented May 9, 2019

hornc commented Jun 26, 2019

LeadSongDog commented Aug 13, 2024 • edited Loading

scottbarnes commented Aug 27, 2024

tfmorris commented Aug 28, 2024

LeadSongDog commented Aug 28, 2024

scottbarnes commented Aug 29, 2024

hornc commented Aug 29, 2024 • edited Loading

LeadSongDog commented Aug 29, 2024 • edited Loading

tfmorris commented Aug 29, 2024

LeadSongDog commented Aug 30, 2024

scottbarnes commented Aug 30, 2024 • edited Loading

LeadSongDog commented Aug 31, 2024

scottbarnes commented Aug 31, 2024

scottbarnes commented Sep 2, 2024

LeadSongDog commented Aug 13, 2024 •

edited

Loading

hornc commented Aug 29, 2024 •

edited

Loading

LeadSongDog commented Aug 29, 2024 •

edited

Loading

scottbarnes commented Aug 30, 2024 •

edited

Loading