Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor get_ia.py to use requests instead of urllib.urlopen #4436

Merged

Conversation

dherbst
Copy link
Contributor

@dherbst dherbst commented Jan 19, 2021

Addresses #2852
This PR replaces #4388

Removes urlopen and uses requests instead.

Technical

The MARC functions require a mixture of binary data and xml. The code expects file type objects out of the helper function, but requests does not provide file type objects, so I've refactored to provide bytes instead.

Testing

It would be good to find out the urls that are used to be able to test this for the XML and binary cases - I haven't been able to get those yet.
For instance, what are example locators for a single MARC record, and a bulk locator?
What is ia_base_url set to?

Screenshot

N/A

Stakeholders

@cclauss @hornc

@dherbst dherbst force-pushed the 2852/refactor/get_ia-requests branch from a94dbc3 to 3567c1c Compare January 22, 2021 13:57
@hornc
Copy link
Collaborator

hornc commented Feb 1, 2021

@dherbst thanks for redoing this. Took me a while to update my Docker images, but I have been able to check this out locally and test. Here are the commands + URLs I used

For bulk imports I used the bulk import bot script https://github.com/internetarchive/openlibrary-bots/blob/master/ia-bulkmarc-bot/bulk-import.py

./bulk-import.py -l -o9068136 -n1 -f BooksAll.2016.part43.utf8  marc_loc_2016

(-l is to point to localhost:8080, -o is the MARC offset, from file BooksAll.2016.part43.utf8 on item https://archive.org/download/marc_loc_2016)

./bulk-import.py -l -o0 -n5 -f dnb_all_dnbmarc_20200615-2.mrc  marc_dnb_202006

This imports the first 5 records from file dnb_all_dnbmarc_20200615-2.mrc on item marc_dnb_202006.

You can use the OL client to test the import URLs directly, although logging in locally is a bit of a fiddle (I can't remember if the client provides a convenient way to target localhost yet):

from olclient.openlibrary import OpenLibrary
ol = OpenLibrary()
from collections import namedtuple
Credentials = namedtuple('Credentials', ['username', 'password'])
local_dev = 'http://localhost:8080'
c = Credentials('openlibrary@example.com', 'admin123')
ol = OpenLibrary(base_url=local_dev, credentials=c)

r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'youngardizzoneau0000ardi'})

The above tests a single IA MARC import.

I received the (good + expected) response:

>>> r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'youngardizzoneau0000ardi'})
>>> r
<Response [200]>
>>> r.text
'{"authors": [{"key": "/authors/OL15A", "name": "Edward Ardizzone", "status": "created"}], "success": true, "edition": {"key": "/books/OL15M", "status": "created"}, "work": {"key": "/works/OL7W", "status": "created"}}'
>>> 

The wiki documentation of these endpoints is at https://github.com/internetarchive/openlibrary/wiki/Endpoints#importing

Also tested an IA import with a 404 response:

>>> r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'not-there'}) 
>>> r
<Response [400]>
>>> r.text
'{"success": false, "error_code": "invalid-ia-identifier", "error": "not-there not found"}'

Which is as expected.

And an import attempt of an IA record which only has MARC XML:

>>> r = ol.session.post(url="%s/api/import/ia" % ol.base_url, data={'identifier': 'bdrc-W8LS18730'})
>>> r
<Response [200]>
>>> r.text
'{"success": true, "edition": {"key": "/books/OL16M", "status": "created"}, "work": {"key": "/works/OL8W", "status": "created"}}'

Also good. 👍

Testing Local SHA: version 3567c1c

openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved
hornc
hornc previously requested changes Feb 1, 2021
Copy link
Collaborator

@hornc hornc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good, I've tested the IA individual binary MARC and XML only imports, as well as the bulk MARC path, and the imports behave as expected with requests -- great work!

The return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).content line needs fixing (see comment above), but that is deprecated code, and I'm not sure how often it triggers. => b'<!--' (or use .text which is possibly better)

Once that is tidied, this can be merged.

@dherbst dherbst requested review from hornc and cclauss February 1, 2021 12:42
Copy link
Contributor

@cclauss cclauss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cclauss cclauss dismissed hornc’s stale review February 1, 2021 13:01

Changed from requests.get().content to requests.get().text

@cclauss cclauss merged commit fad4a40 into internetarchive:master Feb 1, 2021
@cclauss
Copy link
Contributor

cclauss commented Feb 1, 2021

@dherbst and @hornc Thanks much for this!

Can you please keep the collaboration going in an effort to use requests in these remaining marc-related files?

  • openlibrary/catalog/marc/download.py
  • openlibrary/views/showmarc.py
  • scripts/z3950_view.py

A subtask of #2852

Sabreen-Parveen pushed a commit to Sabreen-Parveen/openlibrary that referenced this pull request Feb 5, 2021
…tarchive#4436)

* Move from urllib to requests for 2852

* Use string instead file object.

* Use named headers in call for documentation.

* Use the raw HTTPRequest when you need to read the response like a file.

* Use bytes if reading binary.

* Return bytes when needed.

* Remove TODO, and mention where this is called.

* Correct to use .text for string comparison.
Sabreen-Parveen pushed a commit to Sabreen-Parveen/openlibrary that referenced this pull request Feb 5, 2021
…tarchive#4436)

* Move from urllib to requests for 2852

* Use string instead file object.

* Use named headers in call for documentation.

* Use the raw HTTPRequest when you need to read the response like a file.

* Use bytes if reading binary.

* Return bytes when needed.

* Remove TODO, and mention where this is called.

* Correct to use .text for string comparison.
Sabreen-Parveen pushed a commit to Sabreen-Parveen/openlibrary that referenced this pull request Feb 5, 2021
…tarchive#4436)

* Move from urllib to requests for 2852

* Use string instead file object.

* Use named headers in call for documentation.

* Use the raw HTTPRequest when you need to read the response like a file.

* Use bytes if reading binary.

* Return bytes when needed.

* Remove TODO, and mention where this is called.

* Correct to use .text for string comparison.
Sabreen-Parveen pushed a commit to Sabreen-Parveen/openlibrary that referenced this pull request Feb 5, 2021
…tarchive#4436)

* Move from urllib to requests for 2852

* Use string instead file object.

* Use named headers in call for documentation.

* Use the raw HTTPRequest when you need to read the response like a file.

* Use bytes if reading binary.

* Return bytes when needed.

* Remove TODO, and mention where this is called.

* Correct to use .text for string comparison.
Sabreen-Parveen pushed a commit to Sabreen-Parveen/openlibrary that referenced this pull request Feb 5, 2021
…tarchive#4436)

* Move from urllib to requests for 2852

* Use string instead file object.

* Use named headers in call for documentation.

* Use the raw HTTPRequest when you need to read the response like a file.

* Use bytes if reading binary.

* Return bytes when needed.

* Remove TODO, and mention where this is called.

* Correct to use .text for string comparison.
Sabreen-Parveen pushed a commit to Sabreen-Parveen/openlibrary that referenced this pull request Feb 15, 2021
…tarchive#4436)

* Move from urllib to requests for 2852

* Use string instead file object.

* Use named headers in call for documentation.

* Use the raw HTTPRequest when you need to read the response like a file.

* Use bytes if reading binary.

* Return bytes when needed.

* Remove TODO, and mention where this is called.

* Correct to use .text for string comparison.
@mekarpeles mekarpeles added python Pull requests that update Python code and removed Python 3: bytes vs. str labels Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants