Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ImportBot to import Archive.org works w/o MARCs #459

Closed
mekarpeles opened this issue Apr 4, 2017 · 4 comments
Closed

Fix ImportBot to import Archive.org works w/o MARCs #459

mekarpeles opened this issue Apr 4, 2017 · 4 comments
Assignees
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 1 Do this week, receiving emails, time sensitive, . [managed]

Comments

@mekarpeles
Copy link
Member

Currently, many works digitized by Internet Archive are not making it into Open Library. The root cause is an overly restrictive policy around repub-status values and the requirement for the archive.org item to have a MARC.

  1. ImportBot is run via openlibrary/scripts/manage-imports.py:

sudo -u openlibrary /olsystem/bin/olenv HOME=. OPENLIBRARY_RCFILE=/olsystem/etc/olrc-importbot python scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml import-all

  1. This would query for new IA items (in last day) which must have MARCs and have epub-status of 4 (which is too strict). As part of my fix, I have removed any repub-status check and also removed the requirement for a marc to be present.

  2. A batch is created for all these items (for efficiency sake) and then the items are enumerated and "processed"

  3. Processing entails delegating to openlibrary/core/ia.py which uses get_item_status() as a check to ensure the IA item meets all criteria. This is currently failing at bad-repub-state. As part of my in-progress fix, this check is removed because the query in step catalog/onix/onix.py attempts to use a global variable in init(), but doesn't declare it global #2 has been relaxed.

  4. During processing, the script makes a POST to the openlibrary.org to login and then a POST to the /api/import/ia API endpoint (which under the hood routes to openlibrary/plugins/importapi/code.py -- namely ia_importapi).

  5. Within ia_importapi POST, the metadata for the item (to create an OL work/edition) is requested from ia.get_metadata(key). This is currently failing because no MARC exists in (see "case 4" in code.py's ia_importapi)

@tfmorris
Copy link
Contributor

tfmorris commented Apr 5, 2017

How is author identification done if there's no MARC? Doesn't IA just think author is just a name string?

@mekarpeles mekarpeles self-assigned this Apr 26, 2017
@LeadSongDog
Copy link

LeadSongDog commented Nov 3, 2017

@mekarpeles I notice the recent ImportBot creation of OL26393155M and then OL17803367W from https://openlibrary.org/show-records/ia:guidetoocaseyspl00orio left off a great deal of the information in that MARC record, including the Author name ;-(
What it missed is at least: https://openlibrary.org/books/OL26393155M/A_guide_to_O'Casey's_plays?b=2&a=1&_compare=Compare&m=diff
There may be more identifiers hidden in there too that I didn't capture, e.g. I missed the ISBN 0312353006:
https://openlibrary.org/books/OL26393155M/A_guide_to_O'Casey's_plays?b=3&a=2&_compare=Compare&m=diff

Any idea what's going wrong?

@mekarpeles
Copy link
Member Author

This is timely, will look into this today. Will make and Link a separate issue for import bot missing fields. The original thread is becoming a blocker for @JeffKaplan

@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] importbot labels Jan 8, 2018
@mekarpeles
Copy link
Member Author

Related to #688

@brad2014 brad2014 added the Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] label May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
Development

No branches or pull requests

4 participants