Create ONIX Ingestion Pipeline #860

mekarpeles · 2018-03-13T23:40:49Z

Bibliometa has ONIX feeds which we can import into Open Library:

issues related to onix parsing: #2, #3

Let's normalize/homogenize the onix results across publishers so the syntax and format is predictable
Upload full dumps of normalized xml onix feeds to the onix-for-books item using the IA mek+onix@archive.org account
Upload .tar.gz bookcovers to the onix-for-bookcovers item
We will process each dump and...
- Upload each onix record to Open Library
  - Add the bookcover
  - Cherrypick the metadata
  - Add a link to metadata for the publisher's website
  - save the edits w/ a reference to ONIX, bibliometa, and the publisher
- Update our archive.org digitized book items accordingly
Let's publish a blog post together which highlights all the changed records + the publishers

The items containing the files to import into Open Library are:

https://archive.org/details/onix-for-books -- onix feeds of books we'll want to update
https://archive.org/details/onix-for-bookcovers -- a set of bookcovers we can import into open library

cc: @salman-bhai, @hornc

The text was updated successfully, but these errors were encountered:

sbshah97 · 2018-03-14T04:42:41Z

I'm all in to get started on this! Do we have some documentation on how to import them for the perspective of a developer or should that be added to the To-Do as well?

tfmorris · 2018-03-15T01:55:28Z

The example CUP feed doesn't appear to have any strong identifiers for authors. What's the proposal for reconciling author name and affiliation (which it appears is all that is there) with the Open Library author records?

I am very leery of making the messy OpenLibrary data even messier.

mekarpeles · 2018-03-20T21:58:30Z

As a first pass, we could simply identify which books we already have on OL, and simply add edition/work data to these (without creating any new authors)

mekarpeles · 2018-03-20T22:44:36Z

Cory (from bibliometa) writes:

Indeed, contributor disambiguation is an understandable issue. The ONIX standard can accomodate ISNI identifiers (ISNI.org) for authors and contributors in general. However, very few publishers are currently integrating ISNI's into their ONIX feeds, despite that when you check, many (if not most) contributors already have ISNI numbers.

Ken and I have explored the ISNI API in order to imagine offering ONIX enrichment services through our own platform, but we haven't found the right partners yet who would be motivated to use them.

What are you thoughts around the ISNI, or other author identifiers?

I like your idea of augmenting existing OL book metadata with cherry-picked fields like Main Description as a first step of using the ONIX to improve records.

tfmorris · 2018-03-20T23:09:29Z

Only adding editions would definitely be a lower risk option and would allow starting to get familiar with the ONIX standard, but it wouldn't achieve what I understood to be the primary goal of expanding and modernizing the corpus of cataloged books.

mekarpeles · 2018-03-21T00:03:28Z

@tfmorris you're right -- though if we can get a parser in place and figure out a solution for author authority IDs, there are ~1M records Cory can see about getting us which have isbn. Yes, it does kind of beg the question, how do we get/ensure the author identities

tfmorris · 2018-06-06T21:44:19Z

In addition to the two bug fixes mentioned up top, there are a whole stack of other things that need to be cleaned up since this code hasn't been touched in 9 years. It may even be the case that it's better to use the current code as a specification and reimplement.

Some of the things which I notice at a glance:

PEP-8 spaces instead of tabs
replace Sax with ElementTree or other modern XML access
replace xmltramp.py with a more modern off-the-shelf XML library
replace urlcache.py with Requests cache? May be built-in to XML library, so unnecessary
replace thread_utils.py with modern built-ins
define a custom exception to raise rather than using Exception
names are converted to ASCII (ick! perhaps one of the sources of all our broken names)
importer works directly against a database instead of using the API

LeadSongDog · 2018-06-06T22:13:20Z

@mekarpeles It's long past time to put in place a simple principle: no new author record should be machine-created without links to an established authority record. When there's neither VIAF nor ISNI found, it is extremely likely that the author name is in error. Let's not further pollute the commons. Aside from just name, there should be at least one date (none conflicting), or else a matching coauthor, work title, or publisher at a minimum. Simply matching on name is not enough.

tfmorris · 2018-06-07T04:38:45Z

@LeadSongDog I don't understand the relevance of your comment. The notes from Mar 20 explicitly say no new authors at all.

LeadSongDog · 2018-06-07T13:53:44Z

Yes @tfmorris it’s true that @mekarpeles said that in the “as a first pass” context, but I’m arguing for a more general principle. Getting the urine out of the swimming pool is rather more work than getting it in.

mekarpeles · 2019-12-12T01:57:34Z

Closing this for now, @hornc is driving MARC and amz imports.

mekarpeles added importbot labels Mar 13, 2018

mekarpeles added this to the 2018 Q2 milestone Mar 21, 2018

mekarpeles assigned sbshah97 Mar 29, 2018

mekarpeles mentioned this issue Jun 18, 2018

Create an ONIX-bot for Bibliometa data-set internetarchive/openlibrary-bots#12

Open

8 tasks

hornc added Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] and removed ONIX labels May 5, 2019

mekarpeles closed this as completed Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create ONIX Ingestion Pipeline #860

Create ONIX Ingestion Pipeline #860

mekarpeles commented Mar 13, 2018

sbshah97 commented Mar 14, 2018

tfmorris commented Mar 15, 2018

mekarpeles commented Mar 20, 2018

mekarpeles commented Mar 20, 2018

tfmorris commented Mar 20, 2018

mekarpeles commented Mar 21, 2018

tfmorris commented Jun 6, 2018 •

edited by sbshah97

Loading

LeadSongDog commented Jun 6, 2018 •

edited

Loading

tfmorris commented Jun 7, 2018

LeadSongDog commented Jun 7, 2018

mekarpeles commented Dec 12, 2019

Create ONIX Ingestion Pipeline #860

Create ONIX Ingestion Pipeline #860

Comments

mekarpeles commented Mar 13, 2018

sbshah97 commented Mar 14, 2018

tfmorris commented Mar 15, 2018

mekarpeles commented Mar 20, 2018

mekarpeles commented Mar 20, 2018

tfmorris commented Mar 20, 2018

mekarpeles commented Mar 21, 2018

tfmorris commented Jun 6, 2018 • edited by sbshah97 Loading

LeadSongDog commented Jun 6, 2018 • edited Loading

tfmorris commented Jun 7, 2018

LeadSongDog commented Jun 7, 2018

mekarpeles commented Dec 12, 2019

tfmorris commented Jun 6, 2018 •

edited by sbshah97

Loading

LeadSongDog commented Jun 6, 2018 •

edited

Loading