Fix ImportBot to import Archive.org works w/o MARCs #459

mekarpeles · 2017-04-04T19:16:47Z

Currently, many works digitized by Internet Archive are not making it into Open Library. The root cause is an overly restrictive policy around repub-status values and the requirement for the archive.org item to have a MARC.

ImportBot is run via openlibrary/scripts/manage-imports.py:

sudo -u openlibrary /olsystem/bin/olenv HOME=. OPENLIBRARY_RCFILE=/olsystem/etc/olrc-importbot python scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml import-all

This would query for new IA items (in last day) which must have MARCs and have epub-status of 4 (which is too strict). As part of my fix, I have removed any repub-status check and also removed the requirement for a marc to be present.
A batch is created for all these items (for efficiency sake) and then the items are enumerated and "processed"
Processing entails delegating to openlibrary/core/ia.py which uses get_item_status() as a check to ensure the IA item meets all criteria. This is currently failing at bad-repub-state. As part of my in-progress fix, this check is removed because the query in step catalog/onix/onix.py attempts to use a global variable in init(), but doesn't declare it global #2 has been relaxed.
During processing, the script makes a POST to the openlibrary.org to login and then a POST to the /api/import/ia API endpoint (which under the hood routes to openlibrary/plugins/importapi/code.py -- namely ia_importapi).
Within ia_importapi POST, the metadata for the item (to create an OL work/edition) is requested from ia.get_metadata(key). This is currently failing because no MARC exists in (see "case 4" in code.py's ia_importapi)

The text was updated successfully, but these errors were encountered:

tfmorris · 2017-04-05T01:31:18Z

How is author identification done if there's no MARC? Doesn't IA just think author is just a name string?

LeadSongDog · 2017-11-03T16:10:57Z

@mekarpeles I notice the recent ImportBot creation of OL26393155M and then OL17803367W from https://openlibrary.org/show-records/ia:guidetoocaseyspl00orio left off a great deal of the information in that MARC record, including the Author name ;-(
What it missed is at least: https://openlibrary.org/books/OL26393155M/A_guide_to_O'Casey's_plays?b=2&a=1&_compare=Compare&m=diff
There may be more identifiers hidden in there too that I didn't capture, e.g. I missed the ISBN 0312353006:
https://openlibrary.org/books/OL26393155M/A_guide_to_O'Casey's_plays?b=3&a=2&_compare=Compare&m=diff

Any idea what's going wrong?

mekarpeles · 2017-11-03T16:51:17Z

This is timely, will look into this today. Will make and Link a separate issue for import bot missing fields. The original thread is becoming a blocker for @JeffKaplan

mekarpeles · 2018-01-08T22:06:50Z

Related to #688

mekarpeles self-assigned this Apr 26, 2017

mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] importbot labels Jan 8, 2018

mekarpeles closed this as completed Mar 13, 2018

brad2014 added the Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] label May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ImportBot to import Archive.org works w/o MARCs #459

Fix ImportBot to import Archive.org works w/o MARCs #459

mekarpeles commented Apr 4, 2017

tfmorris commented Apr 5, 2017

LeadSongDog commented Nov 3, 2017 •

edited

Loading

mekarpeles commented Nov 3, 2017

mekarpeles commented Jan 8, 2018

Fix ImportBot to import Archive.org works w/o MARCs #459

Fix ImportBot to import Archive.org works w/o MARCs #459

Comments

mekarpeles commented Apr 4, 2017

tfmorris commented Apr 5, 2017

LeadSongDog commented Nov 3, 2017 • edited Loading

mekarpeles commented Nov 3, 2017

mekarpeles commented Jan 8, 2018

LeadSongDog commented Nov 3, 2017 •

edited

Loading