Refactor get_ia.py to use requests instead of urllib.urlopen #4436

dherbst · 2021-01-19T17:42:32Z

Addresses #2852
This PR replaces #4388

Removes urlopen and uses requests instead.

Technical

The MARC functions require a mixture of binary data and xml. The code expects file type objects out of the helper function, but requests does not provide file type objects, so I've refactored to provide bytes instead.

Testing

It would be good to find out the urls that are used to be able to test this for the XML and binary cases - I haven't been able to get those yet.
For instance, what are example locators for a single MARC record, and a bulk locator?
What is ia_base_url set to?

Screenshot

N/A

Stakeholders

@cclauss @hornc

openlibrary/catalog/get_ia.py

hornc · 2021-02-01T04:42:46Z

@dherbst thanks for redoing this. Took me a while to update my Docker images, but I have been able to check this out locally and test. Here are the commands + URLs I used

For bulk imports I used the bulk import bot script https://github.com/internetarchive/openlibrary-bots/blob/master/ia-bulkmarc-bot/bulk-import.py

./bulk-import.py -l -o9068136 -n1 -f BooksAll.2016.part43.utf8  marc_loc_2016

(-l is to point to localhost:8080, -o is the MARC offset, from file BooksAll.2016.part43.utf8 on item https://archive.org/download/marc_loc_2016)

./bulk-import.py -l -o0 -n5 -f dnb_all_dnbmarc_20200615-2.mrc  marc_dnb_202006

This imports the first 5 records from file dnb_all_dnbmarc_20200615-2.mrc on item marc_dnb_202006.

You can use the OL client to test the import URLs directly, although logging in locally is a bit of a fiddle (I can't remember if the client provides a convenient way to target localhost yet):

from olclient.openlibrary import OpenLibrary
ol = OpenLibrary()
from collections import namedtuple
Credentials = namedtuple('Credentials', ['username', 'password'])
local_dev = 'http://localhost:8080'
c = Credentials('openlibrary@example.com', 'admin123')
ol = OpenLibrary(base_url=local_dev, credentials=c)

r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'youngardizzoneau0000ardi'})

The above tests a single IA MARC import.

I received the (good + expected) response:

>>> r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'youngardizzoneau0000ardi'})
>>> r
<Response [200]>
>>> r.text
'{"authors": [{"key": "/authors/OL15A", "name": "Edward Ardizzone", "status": "created"}], "success": true, "edition": {"key": "/books/OL15M", "status": "created"}, "work": {"key": "/works/OL7W", "status": "created"}}'
>>>

The wiki documentation of these endpoints is at https://github.com/internetarchive/openlibrary/wiki/Endpoints#importing

Also tested an IA import with a 404 response:

>>> r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'not-there'}) 
>>> r
<Response [400]>
>>> r.text
'{"success": false, "error_code": "invalid-ia-identifier", "error": "not-there not found"}'

Which is as expected.

And an import attempt of an IA record which only has MARC XML:

>>> r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'bdrc-W8LS18730'})
>>> r
<Response [200]>
>>> r.text
'{"success": true, "edition": {"key": "/books/OL16M", "status": "created"}, "work": {"key": "/works/OL8W", "status": "created"}}'

Also good. 👍

Testing Local SHA: version 3567c1c

openlibrary/catalog/get_ia.py

hornc

This PR looks good, I've tested the IA individual binary MARC and XML only imports, as well as the bulk MARC path, and the imports behave as expected with requests -- great work!

The return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).content line needs fixing (see comment above), but that is deprecated code, and I'm not sure how often it triggers. => b'<!--' (or use .text which is possibly better)

Once that is tidied, this can be merged.

cclauss

LGTM

Changed from requests.get().content to requests.get().text

cclauss · 2021-02-01T14:56:41Z

@dherbst and @hornc Thanks much for this!

Can you please keep the collaboration going in an effort to use requests in these remaining marc-related files?

openlibrary/catalog/marc/download.py
openlibrary/views/showmarc.py
scripts/z3950_view.py

A subtask of #2852

…tarchive#4436) * Move from urllib to requests for 2852 * Use string instead file object. * Use named headers in call for documentation. * Use the raw HTTPRequest when you need to read the response like a file. * Use bytes if reading binary. * Return bytes when needed. * Remove TODO, and mention where this is called. * Correct to use .text for string comparison.

dherbst mentioned this pull request Jan 19, 2021

Move from urllib to requests in openlibrary/catalog/get_ia.py #4388

Closed

mekarpeles assigned cclauss Jan 19, 2021

cclauss assigned hornc Jan 19, 2021

cclauss added the Python 3: bytes vs. str label Jan 19, 2021

mekarpeles self-requested a review January 19, 2021 19:55

cclauss reviewed Jan 21, 2021

View reviewed changes

openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved

dherbst added 7 commits January 22, 2021 08:56

Move from urllib to requests for 2852

10ac1ab

Use string instead file object.

8ee234a

Use named headers in call for documentation.

49b7ce4

Use the raw HTTPRequest when you need to read the response like a file.

91b7bb3

Use bytes if reading binary.

7581bc3

Return bytes when needed.

ab76bef

Remove TODO, and mention where this is called.

3567c1c

dherbst force-pushed the 2852/refactor/get_ia-requests branch from a94dbc3 to 3567c1c Compare January 22, 2021 13:57

hornc reviewed Feb 1, 2021

View reviewed changes

openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved

hornc previously requested changes Feb 1, 2021

View reviewed changes

Correct to use .text for string comparison.

9984bc2

dherbst requested review from hornc and cclauss February 1, 2021 12:42

cclauss approved these changes Feb 1, 2021

View reviewed changes

cclauss merged commit fad4a40 into internetarchive:master Feb 1, 2021

hornc mentioned this pull request Feb 2, 2021

remove unused and deprecated code #4520

Merged

mekarpeles added python Pull requests that update Python code and removed Python 3: bytes vs. str labels Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor get_ia.py to use requests instead of urllib.urlopen #4436

Refactor get_ia.py to use requests instead of urllib.urlopen #4436

dherbst commented Jan 19, 2021 •

edited

Loading

hornc commented Feb 1, 2021

hornc left a comment •

edited

Loading

cclauss left a comment

cclauss commented Feb 1, 2021 •

edited

Loading

Refactor get_ia.py to use requests instead of urllib.urlopen #4436

Refactor get_ia.py to use requests instead of urllib.urlopen #4436

Conversation

dherbst commented Jan 19, 2021 • edited Loading

Technical

Testing

Screenshot

Stakeholders

hornc commented Feb 1, 2021

hornc left a comment • edited Loading

Choose a reason for hiding this comment

cclauss left a comment

Choose a reason for hiding this comment

cclauss commented Feb 1, 2021 • edited Loading

dherbst commented Jan 19, 2021 •

edited

Loading

hornc left a comment •

edited

Loading

cclauss commented Feb 1, 2021 •

edited

Loading