-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119
Comments
To give an idea of the scope of this: in the May 2019 edition dump there are
|
So #895 just closed without changing code, but the underlying idea here, to exploit ISBN prefixes to fill in blank publisher fields, still has potential to quickly improve data. Can we revisit it? I see no reason why the prefixes could not become readily searchable, yielding a standard spelling for the publisher, and even an approximate year of publication (as adjacent ISBNs will usually be assigned in the same year). |
@LeadSongDog, can you tell me a bit more about the strategy here? I tried the links in #895, but they no longer seem to work. Using https://openlibrary.org/books/OL3697910M/General_chemistry as an example, it has ISBN 13 9780618399413, but, to ask a stupid question, how do I determine the prefix? I tried querying for My goal is to understand the process so the issue can be better broken down into steps that can be used to close the issue. |
To update the 5 year old numbers above #2119 (comment), there are currently 2281048 (2.3M) editions without publishers, double the number from 5 years ago, and 513197 (0.5M) of those have ISBNs. |
Only the vaguest ideas on implementation, but… One might start from an edition record with isbn but not a publisher identified. https://openlibrary.org/search?q=isbn%3A+97806183994*&mode=everything or https://openlibrary.org/search?q=isbn%3A+9780618399*&mode=everything or https://openlibrary.org/search?q=isbn%3A+978061839*&mode=everything A quick comparison of the results shows that the closest ISBNs had the most similarities, even revealing variant spellings for the publisher and authors. The shortest (first) list above includes several spellings for Houghton Mifflin seen at Q390074. It might be simpler to start from a dump of editions, then sort on isbn? |
I know I am being a bit slow here, but the part that isn't fully clear to me is how to get from the ISBN to the publisher, or at least to the prefix. I think we might need to be able to do that on a large scale to do it on the data dump. |
@scottbarnes to answer the 'how to get the prefix', you can use the
and get all but the last two groups. The prefixes are assigned by a registry, and the ranges are updated every so often. Your example for I was going to say that doesn't make sense to me, unless it is different imprint levels that are owned by the same parent company, but that might be what is happening here. Clarion Books is owned by Harper Collins, and Harper Collins has bought Houghton Mifflin at some point, so maybe that range has been transferred? This might make it more difficult to extract the original publisher (because they keep eating each other). It seemed like a reasonable approach to extract publishers from ISBN prefixes, but your example seems to show that this can change over time. Even without this complicating factor, an ISBN prefix lookup might give results at a different imprint level, which may not be that useful for someone searching for bibliographic publisher metadata. That leads to the question: what use-case does populating publisher from ISBN serve? OL doesn't use publisher to disambiguate between books, primarily because publisher is just an uncontrolled string, and there is so much variations in forms and imprints it doesn't add much. Any item with an ISBN already has that as a more unambiguous id for any kind of look up. The publisher string determined by an ISBN look is probably quite likely to not appear on the book in that form. |
Particularly for textbooks, some work titles, such as « Chemistry » or « Calculus » are reused by multiple works. To merge the work records, they must be disambiguated too. By identifying the publisher of an edition, it is often possible to (A)determine more completely the author(s) for each work (as cataloguers and online merchants often only list author surnames) and (B)determine which synonymous work the edition is from. Work-merging will still require the merged work records to agree on linked authors. |
Like @hornc, I'm suspicious of this approach. It doesn't seem like a reliable way to source metadata. The original problem (no publisher stated) is created by using poor quality metadata to start with, so let's not compound it. To take a random example from early in a recent edition dump https://openlibrary.org/books/OL11812091M I'm not sure why the IA MARC record wasn't used to populate the publisher, but it strikes me that even if it weren't available, using WorldCat would be better than trying to guess from the ISBN. |
@tfmorris That’s an interesting example. Of course I agree that low quality metadata sources (goodreadsss, AMZ, BwB) should not be amplified, but that still seems to be accepted OL practice. I’d prefer to simply delete what can’t be verified, but I can’t. Would you prefer to have us just leave the mess rather than clean it up? The Promise record was attached to the edition long after the book had been scanned into IA: https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff The problem was aggravated in that the low quality source (Promise) was allowed to overwrite the high quality source (scan). There ought to be logic preventing this. |
For my part, I am not convinced we can, with confidence, get from the ISBN to the correct publisher. But I do agree that low quality imports should not overwrite higher quality metadata, though actual overwriting of populated fields shouldn't currently be happening, and if it is, I think that is likely a bug to be addressed. In the case of https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff, again I apologize for being slow, @LeadSongDog, but I should be upfront about my ignorance to save everyone time: can you help explain the possible harm from adding We may be getting afield of the specific issue of using the ISBN prefix to populate the publisher. Unless there is way to do this with confidence, I am inclined to close this specific issue. However, there is still more to do in terms of improving quality. Hopefully the changes in #9587 and #9574, along with the forthcoming changes in #9753 and PRs to address #9808 and #9831 will help limit the light records. It may also be the case that the suggestion in #9808 not to match MARC imports without an ISBN to an existing edition with only title + ISBN should be extended to all imports, but that may be a discussion for elsewhere. |
@scottbarnes |
I see that I have found it useful at times to use the source record, e.g. |
Pending a way to confidently add publisher records from the ISBN prefix, I am going to close this as not (currently) planned. |
Many edition records have no publisher shown, but do have an ISBN. Previous discussion at #895 shows how to get an official spelling of the publisher from the ISBN.
The text was updated successfully, but these errors were encountered: