-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor get_ia.py to use requests instead of urllib.urlopen #4436
Refactor get_ia.py to use requests instead of urllib.urlopen #4436
Conversation
a94dbc3
to
3567c1c
Compare
@dherbst thanks for redoing this. Took me a while to update my Docker images, but I have been able to check this out locally and test. Here are the commands + URLs I used For bulk imports I used the bulk import bot script https://github.com/internetarchive/openlibrary-bots/blob/master/ia-bulkmarc-bot/bulk-import.py
(-l is to point to localhost:8080, -o is the MARC offset, from file BooksAll.2016.part43.utf8 on item https://archive.org/download/marc_loc_2016)
This imports the first 5 records from file dnb_all_dnbmarc_20200615-2.mrc on item marc_dnb_202006. You can use the OL client to test the import URLs directly, although logging in locally is a bit of a fiddle (I can't remember if the client provides a convenient way to target localhost yet):
The above tests a single IA MARC import. I received the (good + expected) response:
The wiki documentation of these endpoints is at https://github.com/internetarchive/openlibrary/wiki/Endpoints#importing Also tested an IA import with a 404 response:
Which is as expected. And an import attempt of an IA record which only has MARC XML:
Also good. 👍 Testing Local SHA: version 3567c1c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good, I've tested the IA individual binary MARC and XML only imports, as well as the bulk MARC path, and the imports behave as expected with requests -- great work!
The return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).content
line needs fixing (see comment above), but that is deprecated code, and I'm not sure how often it triggers. => b'<!--'
(or use .text
which is possibly better)
Once that is tidied, this can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changed from requests.get().content to requests.get().text
…tarchive#4436) * Move from urllib to requests for 2852 * Use string instead file object. * Use named headers in call for documentation. * Use the raw HTTPRequest when you need to read the response like a file. * Use bytes if reading binary. * Return bytes when needed. * Remove TODO, and mention where this is called. * Correct to use .text for string comparison.
…tarchive#4436) * Move from urllib to requests for 2852 * Use string instead file object. * Use named headers in call for documentation. * Use the raw HTTPRequest when you need to read the response like a file. * Use bytes if reading binary. * Return bytes when needed. * Remove TODO, and mention where this is called. * Correct to use .text for string comparison.
…tarchive#4436) * Move from urllib to requests for 2852 * Use string instead file object. * Use named headers in call for documentation. * Use the raw HTTPRequest when you need to read the response like a file. * Use bytes if reading binary. * Return bytes when needed. * Remove TODO, and mention where this is called. * Correct to use .text for string comparison.
…tarchive#4436) * Move from urllib to requests for 2852 * Use string instead file object. * Use named headers in call for documentation. * Use the raw HTTPRequest when you need to read the response like a file. * Use bytes if reading binary. * Return bytes when needed. * Remove TODO, and mention where this is called. * Correct to use .text for string comparison.
…tarchive#4436) * Move from urllib to requests for 2852 * Use string instead file object. * Use named headers in call for documentation. * Use the raw HTTPRequest when you need to read the response like a file. * Use bytes if reading binary. * Return bytes when needed. * Remove TODO, and mention where this is called. * Correct to use .text for string comparison.
…tarchive#4436) * Move from urllib to requests for 2852 * Use string instead file object. * Use named headers in call for documentation. * Use the raw HTTPRequest when you need to read the response like a file. * Use bytes if reading binary. * Return bytes when needed. * Remove TODO, and mention where this is called. * Correct to use .text for string comparison.
Addresses #2852
This PR replaces #4388
Removes urlopen and uses requests instead.
Technical
The MARC functions require a mixture of binary data and xml. The code expects file type objects out of the helper function, but requests does not provide file type objects, so I've refactored to provide bytes instead.
Testing
It would be good to find out the urls that are used to be able to test this for the XML and binary cases - I haven't been able to get those yet.
For instance, what are example locators for a single MARC record, and a bulk locator?
What is
ia_base_url
set to?Screenshot
N/A
Stakeholders
@cclauss @hornc