-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] download original file formats from Dataverse #1242 #1253
[MRG] download original file formats from Dataverse #1242 #1253
Conversation
Dataverse creates plain-text, preservation-friendly copies of certain file formats (some of which are proprietary, such as Stata or SPSS) and this .tab (tab-separated) file is downloaded unless you supply `format=original`, which is what this pull request does. The original filename (e.g. foo.dta, a Stata file) comes from `originalFileName`, which is only populated when the preservation copy (e.g. foo.tab) has been successfully created. Additional variables were created to distinguish between `filename`, `original_filename`, and `filename_with_path`. If `original_filename` is available, it's the right one to use. To allow the tests to continue passing, the query parameters are now removed so just the file id can be cast as an int.
Thanks for submitting your first pull request! You are awesome! 🤗 |
@betatim hi! Over in Discourse it was suggested to me to reach out to existing maintainers who have particular knowledge in the area and since you encouraged me (thanks!) to make this PR and merged the original Dataverse content provider PR (#739), I thought I'd start with you. What do you think? Does the PR make sense? Any questions? Thanks!! ❤️ ❤️ ❤️ |
@pdurbin this looks sensible to me. The linter's upset and the autofix bot didn't run because it doesn't have permission to push to IQSS (PRs from orgs don't usually grant maintainers edit access, unlike PRs from personal forks). If you run --- a/tests/unit/contentproviders/test_dataverse.py
+++ b/tests/unit/contentproviders/test_dataverse.py
@@ -4,8 +4,8 @@ import re
from io import BytesIO
from tempfile import TemporaryDirectory
from unittest.mock import patch
-from urllib.request import Request, urlopen
from urllib.parse import urlsplit
+from urllib.request import Request, urlopen |
@minrk thanks for approving! Yes, I was able to fix the isort problem in 48f4cc6 with It gave the following output:
I see now that the docs are pretty clear about pre-commit: https://repo2docker.readthedocs.io/en/2022.10.0/contributing/contributing.html#code-formatting Sorry for missing that! It looks like the pre-commit CI test passed. 🎉 |
Thanks! |
@minrk thanks for merging! Can you please give me a sense of when this will be available on mybinder.org? I poked around in the docs and found https://repo2docker.readthedocs.io/en/latest/contributing/tasks.html#creating-a-release that says, "We make a release of whatever is on main every month." However, it seems like actual release may be a bit less frequent, which is understandable (Dataverse certainly doesn't release every month!). https://repo2docker.readthedocs.io/en/latest/changelog.html gives me a sense of the cadence:
Hmm, and I'm guess this is just the repo2docker side. I'm not sure how often mybinder.org pulls in releases from repo2docker. I'm asking because once this "download the original format" fix is in production at mybinder.org, I'll make an announcement to the Dataverse community (the initial announcement about Binder support was in February) and mention it in our next release notes. Thanks. |
Mybinder.org pulls in repo2docker updates automatically. Should be deployed later today or tomorrow. |
It is now deployed. Thanks! |
@minrk thanks! I've been playing around with it. Going back to this In the Dataverse UI, we show the .tab (archival) version. If you click the "Binder" button... ... a container spins up... ... and we can see that the original versions (.dta for a Stata file) has been downloaded in Binder! Time to play with that data! 🎉 |
Dataverse creates plain-text, preservation-friendly copies of certain file formats (some of which are proprietary, such as Stata or SPSS) and this .tab (tab-separated) file is downloaded unless you supply
format=original
, which is what this pull request does.The original filename (e.g. foo.dta, a Stata file) comes from
originalFileName
, which is only populated when the preservation copy (e.g. foo.tab) has been successfully created.Additional variables were created to distinguish between
filename
,original_filename
, andfilename_with_path
. Iforiginal_filename
is available, it's the right one to use.To allow the tests to continue passing, the query parameters are now removed so just the file id can be cast as an int.
I tested it with a random dataset that has Stata files (.dta):
Without this fix you get
country_groups.tab
(tab separated), for example.