[MRG] download original file formats from Dataverse #1242 #1253

pdurbin · 2023-03-10T21:51:59Z

Dataverse creates plain-text, preservation-friendly copies of certain file formats (some of which are proprietary, such as Stata or SPSS) and this .tab (tab-separated) file is downloaded unless you supply format=original, which is what this pull request does.

The original filename (e.g. foo.dta, a Stata file) comes from originalFileName, which is only populated when the preservation copy (e.g. foo.tab) has been successfully created.

Additional variables were created to distinguish between filename, original_filename, and filename_with_path. If original_filename is available, it's the right one to use.

To allow the tests to continue passing, the query parameters are now removed so just the file id can be cast as an int.

I tested it with a random dataset that has Stata files (.dta):

beamish:repo2docker pdurbin$ repo2docker doi:10.7910/DVN/IVLEHB
beamish:repo2docker pdurbin$ 
beamish:repo2docker pdurbin$ docker exec -it inspiring_almeida /bin/bash
pdurbin@521a8dd1e1cc:~$ ls -1
'Codebook for Relational UNFCCC Data.pdf'
country_groups.dta
ENB_relationships.dta
statements_count.dta
unfccc_ratification.dta
'Variable and value labels.xlsx'
pdurbin@521a8dd1e1cc:~$

Without this fix you get country_groups.tab (tab separated), for example.

Closes Dataverse content provider: download files in original format #1242

Dataverse creates plain-text, preservation-friendly copies of certain file formats (some of which are proprietary, such as Stata or SPSS) and this .tab (tab-separated) file is downloaded unless you supply `format=original`, which is what this pull request does. The original filename (e.g. foo.dta, a Stata file) comes from `originalFileName`, which is only populated when the preservation copy (e.g. foo.tab) has been successfully created. Additional variables were created to distinguish between `filename`, `original_filename`, and `filename_with_path`. If `original_filename` is available, it's the right one to use. To allow the tests to continue passing, the query parameters are now removed so just the file id can be cast as an int.

welcome · 2023-03-10T21:52:01Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.

You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

pdurbin · 2023-03-28T12:47:49Z

@betatim hi! Over in Discourse it was suggested to me to reach out to existing maintainers who have particular knowledge in the area and since you encouraged me (thanks!) to make this PR and merged the original Dataverse content provider PR (#739), I thought I'd start with you.

What do you think? Does the PR make sense? Any questions? Thanks!! ❤️ ❤️ ❤️

minrk · 2023-03-29T08:07:24Z

@pdurbin this looks sensible to me. The linter's upset and the autofix bot didn't run because it doesn't have permission to push to IQSS (PRs from orgs don't usually grant maintainers edit access, unlike PRs from personal forks).

If you run pre-commit install and pre-commit run --all-files and commit the results it should be happy. It's only isort that needs appeasement, if you want to do it by hand (but the reason we use autoformatters is to not think about these things):

--- a/tests/unit/contentproviders/test_dataverse.py
+++ b/tests/unit/contentproviders/test_dataverse.py
@@ -4,8 +4,8 @@ import re
 from io import BytesIO
 from tempfile import TemporaryDirectory
 from unittest.mock import patch
-from urllib.request import Request, urlopen
 from urllib.parse import urlsplit
+from urllib.request import Request, urlopen

pdurbin · 2023-03-29T09:41:16Z

@minrk thanks for approving!

Yes, I was able to fix the isort problem in 48f4cc6 with pre-commit run --all-files.

It gave the following output:

(venv) pdurbin@air repo2docker % pre-commit run --all-files
[INFO] Initializing environment for https://github.com/asottile/pyupgrade.
[INFO] Initializing environment for https://github.com/psf/black.
[INFO] Initializing environment for https://github.com/pycqa/isort.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier:prettier@3.0.0-alpha.6.
[INFO] Installing environment for https://github.com/asottile/pyupgrade.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pycqa/isort.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
pyupgrade................................................................Passed
black....................................................................Passed
isort....................................................................Failed
- hook id: isort
- files were modified by this hook

Fixing /Users/pdurbin/github/jupyterhub/repo2docker/tests/unit/contentproviders/test_dataverse.py

prettier.................................................................Passed

I see now that the docs are pretty clear about pre-commit: https://repo2docker.readthedocs.io/en/2022.10.0/contributing/contributing.html#code-formatting

Sorry for missing that!

It looks like the pre-commit CI test passed. 🎉

welcome · 2023-03-29T10:29:58Z

Congrats on your first merged pull request in this project! 🎉

Thank you for contributing, we are very proud of you! ❤️

minrk · 2023-03-29T10:29:58Z

Thanks!

pdurbin · 2023-03-29T10:50:45Z

@minrk thanks for merging!

Can you please give me a sense of when this will be available on mybinder.org?

I poked around in the docs and found https://repo2docker.readthedocs.io/en/latest/contributing/tasks.html#creating-a-release that says, "We make a release of whatever is on main every month."

However, it seems like actual release may be a bit less frequent, which is understandable (Dataverse certainly doesn't release every month!). https://repo2docker.readthedocs.io/en/latest/changelog.html gives me a sense of the cadence:

Version 2022.10.0
Version 2022.02.0
Version 2021.08.0
Version 2021.03.0

Hmm, and I'm guess this is just the repo2docker side. I'm not sure how often mybinder.org pulls in releases from repo2docker.

I'm asking because once this "download the original format" fix is in production at mybinder.org, I'll make an announcement to the Dataverse community (the initial announcement about Binder support was in February) and mention it in our next release notes.

Thanks.

minrk · 2023-03-29T11:01:28Z

Mybinder.org pulls in repo2docker updates automatically. Should be deployed later today or tomorrow.

minrk · 2023-03-29T13:24:01Z

It is now deployed. Thanks!

pdurbin · 2023-03-29T14:28:44Z

@minrk thanks! I've been playing around with it.

Going back to this country_groups.dta (original) vs country_groups.tab (archival), Binder is now downloading the original version from that dataset.

In the Dataverse UI, we show the .tab (archival) version. If you click the "Binder" button...

... a container spins up...

... and we can see that the original versions (.dta for a Stata file) has been downloaded in Binder!

Time to play with that data! 🎉

pdurbin mentioned this pull request Mar 10, 2023

Dataverse content provider: download files in original format #1242

Closed

pdurbin changed the title ~~download original file formats from Dataverse #1242~~ [MRG] download original file formats from Dataverse #1242 Mar 10, 2023

pdurbin mentioned this pull request Mar 13, 2023

Fix Binder and Whole Tale (repo2docker) to download original files rather than archival .tab files IQSS/dataverse#9374

Closed

minrk approved these changes Mar 29, 2023

View reviewed changes

make isort happy with pre-commit run --all-files jupyterhub#1242

48f4cc6

minrk merged commit 43ff7bb into jupyterhub:main Mar 29, 2023

jupyterhub-bot mentioned this pull request Mar 29, 2023

Update quay.io/jupyterhub/repo2docker version to 2022.10.0-148.g43ff7bb jupyterhub/mybinder.org-deploy#2537

Merged

pdurbin mentioned this pull request Mar 29, 2023

Binder: files are now downloaded in original format (e.g. .dta instead of .tab) IQSS/dataverse#9483

Merged

yuvipanda added the enhancement label Jun 5, 2023

pdurbin mentioned this pull request Oct 8, 2023

Use/Maintain Appropriate File Formats for Preservation and Reproducibility IQSS/dataverse#6006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] download original file formats from Dataverse #1242 #1253

[MRG] download original file formats from Dataverse #1242 #1253

pdurbin commented Mar 10, 2023 •

edited

Loading

welcome bot commented Mar 10, 2023

pdurbin commented Mar 28, 2023

minrk commented Mar 29, 2023

pdurbin commented Mar 29, 2023

welcome bot commented Mar 29, 2023

minrk commented Mar 29, 2023

pdurbin commented Mar 29, 2023

minrk commented Mar 29, 2023

minrk commented Mar 29, 2023

pdurbin commented Mar 29, 2023

[MRG] download original file formats from Dataverse #1242 #1253

[MRG] download original file formats from Dataverse #1242 #1253

Conversation

pdurbin commented Mar 10, 2023 • edited Loading

welcome bot commented Mar 10, 2023

pdurbin commented Mar 28, 2023

minrk commented Mar 29, 2023

pdurbin commented Mar 29, 2023

welcome bot commented Mar 29, 2023

minrk commented Mar 29, 2023

pdurbin commented Mar 29, 2023

minrk commented Mar 29, 2023

minrk commented Mar 29, 2023

pdurbin commented Mar 29, 2023

pdurbin commented Mar 10, 2023 •

edited

Loading