Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] download original file formats from Dataverse #1242 #1253

Merged
merged 2 commits into from
Mar 29, 2023

Conversation

pdurbin
Copy link
Contributor

@pdurbin pdurbin commented Mar 10, 2023

Dataverse creates plain-text, preservation-friendly copies of certain file formats (some of which are proprietary, such as Stata or SPSS) and this .tab (tab-separated) file is downloaded unless you supply format=original, which is what this pull request does.

The original filename (e.g. foo.dta, a Stata file) comes from originalFileName, which is only populated when the preservation copy (e.g. foo.tab) has been successfully created.

Additional variables were created to distinguish between filename, original_filename, and filename_with_path. If original_filename is available, it's the right one to use.

To allow the tests to continue passing, the query parameters are now removed so just the file id can be cast as an int.

I tested it with a random dataset that has Stata files (.dta):

beamish:repo2docker pdurbin$ repo2docker doi:10.7910/DVN/IVLEHB
beamish:repo2docker pdurbin$ 
beamish:repo2docker pdurbin$ docker exec -it inspiring_almeida /bin/bash
pdurbin@521a8dd1e1cc:~$ ls -1
'Codebook for Relational UNFCCC Data.pdf'
country_groups.dta
ENB_relationships.dta
statements_count.dta
unfccc_ratification.dta
'Variable and value labels.xlsx'
pdurbin@521a8dd1e1cc:~$ 

Without this fix you get country_groups.tab (tab separated), for example.

Dataverse creates plain-text, preservation-friendly copies of certain
file formats (some of which are proprietary, such as Stata or SPSS) and
this .tab (tab-separated) file is downloaded unless you supply
`format=original`, which is what this pull request does.

The original filename (e.g. foo.dta, a Stata file) comes from
`originalFileName`, which is only populated when the preservation copy
(e.g. foo.tab) has been successfully created.

Additional variables were created to distinguish between `filename`,
`original_filename`, and `filename_with_path`. If `original_filename`
is available, it's the right one to use.

To allow the tests to continue passing, the query parameters are now
removed so just the file id can be cast as an int.
@welcome
Copy link

welcome bot commented Mar 10, 2023

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@pdurbin pdurbin changed the title download original file formats from Dataverse #1242 [MRG] download original file formats from Dataverse #1242 Mar 10, 2023
@pdurbin
Copy link
Contributor Author

pdurbin commented Mar 28, 2023

@betatim hi! Over in Discourse it was suggested to me to reach out to existing maintainers who have particular knowledge in the area and since you encouraged me (thanks!) to make this PR and merged the original Dataverse content provider PR (#739), I thought I'd start with you.

What do you think? Does the PR make sense? Any questions? Thanks!! ❤️ ❤️ ❤️

@minrk
Copy link
Member

minrk commented Mar 29, 2023

@pdurbin this looks sensible to me. The linter's upset and the autofix bot didn't run because it doesn't have permission to push to IQSS (PRs from orgs don't usually grant maintainers edit access, unlike PRs from personal forks).

If you run pre-commit install and pre-commit run --all-files and commit the results it should be happy. It's only isort that needs appeasement, if you want to do it by hand (but the reason we use autoformatters is to not think about these things):

--- a/tests/unit/contentproviders/test_dataverse.py
+++ b/tests/unit/contentproviders/test_dataverse.py
@@ -4,8 +4,8 @@ import re
 from io import BytesIO
 from tempfile import TemporaryDirectory
 from unittest.mock import patch
-from urllib.request import Request, urlopen
 from urllib.parse import urlsplit
+from urllib.request import Request, urlopen

@pdurbin
Copy link
Contributor Author

pdurbin commented Mar 29, 2023

@minrk thanks for approving!

Yes, I was able to fix the isort problem in 48f4cc6 with pre-commit run --all-files.

It gave the following output:

(venv) pdurbin@air repo2docker % pre-commit run --all-files
[INFO] Initializing environment for https://github.com/asottile/pyupgrade.
[INFO] Initializing environment for https://github.com/psf/black.
[INFO] Initializing environment for https://github.com/pycqa/isort.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier:prettier@3.0.0-alpha.6.
[INFO] Installing environment for https://github.com/asottile/pyupgrade.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pycqa/isort.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
pyupgrade................................................................Passed
black....................................................................Passed
isort....................................................................Failed
- hook id: isort
- files were modified by this hook

Fixing /Users/pdurbin/github/jupyterhub/repo2docker/tests/unit/contentproviders/test_dataverse.py

prettier.................................................................Passed

I see now that the docs are pretty clear about pre-commit: https://repo2docker.readthedocs.io/en/2022.10.0/contributing/contributing.html#code-formatting

Sorry for missing that!

It looks like the pre-commit CI test passed. 🎉

@minrk minrk merged commit 43ff7bb into jupyterhub:main Mar 29, 2023
@welcome
Copy link

welcome bot commented Mar 29, 2023

Congrats on your first merged pull request in this project! 🎉
congrats
Thank you for contributing, we are very proud of you! ❤️

@minrk
Copy link
Member

minrk commented Mar 29, 2023

Thanks!

@pdurbin
Copy link
Contributor Author

pdurbin commented Mar 29, 2023

@minrk thanks for merging!

Can you please give me a sense of when this will be available on mybinder.org?

I poked around in the docs and found https://repo2docker.readthedocs.io/en/latest/contributing/tasks.html#creating-a-release that says, "We make a release of whatever is on main every month."

However, it seems like actual release may be a bit less frequent, which is understandable (Dataverse certainly doesn't release every month!). https://repo2docker.readthedocs.io/en/latest/changelog.html gives me a sense of the cadence:

  • Version 2022.10.0
  • Version 2022.02.0
  • Version 2021.08.0
  • Version 2021.03.0

Hmm, and I'm guess this is just the repo2docker side. I'm not sure how often mybinder.org pulls in releases from repo2docker.

I'm asking because once this "download the original format" fix is in production at mybinder.org, I'll make an announcement to the Dataverse community (the initial announcement about Binder support was in February) and mention it in our next release notes.

Thanks.

@minrk
Copy link
Member

minrk commented Mar 29, 2023

Mybinder.org pulls in repo2docker updates automatically. Should be deployed later today or tomorrow.

@minrk
Copy link
Member

minrk commented Mar 29, 2023

It is now deployed. Thanks!

@pdurbin
Copy link
Contributor Author

pdurbin commented Mar 29, 2023

@minrk thanks! I've been playing around with it.

Going back to this country_groups.dta (original) vs country_groups.tab (archival), Binder is now downloading the original version from that dataset.

In the Dataverse UI, we show the .tab (archival) version. If you click the "Binder" button...

Screenshot 2023-03-29 at 10-05-27 Relational Data between Parties to the UN Framework Convention on Climate Change

... a container spins up...

Screenshot 2023-03-29 at 10-05-55 Dataverse 10 7910_DVN_IVLEHB

... and we can see that the original versions (.dta for a Stata file) has been downloaded in Binder!

Time to play with that data! 🎉

Screenshot 2023-03-29 at 10-22-07 JupyterLab (auto-E)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataverse content provider: download files in original format
3 participants