-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Dataverse content provider #739
Conversation
7430652
to
3b84c52
Compare
Do we want to allow Is it "standard" to do something like |
I'm certainly not an expert in this field, but there are other types of persistent digital objects identifiers (DOI is kinda a loaded term...), one example is handle.net (e.g. hdl:1902.1/12379). They have similar structure as you can see
With Zenodo you just got lucky ;) I don't know if that's the best example, cause doi:1902.1/12379 and 1902.1/12379 works too, but I'm sure @pdurbin can give us some examples. |
I am now thinking that I misunderstood what you are proposing :) I thought you were suggesting that we should make
all seem to be legal and "do the right thing". Which is pretty cool actually :) Now I'm thinking that we should allow both forms and encourage people to use one with "prefix" when there could be ambiguity (is it true that no prefix means "use the default" and the default is "doi:"?). i will try and take a look at the code tomorrow. |
Yes. The craziest DOI I'm aware of is https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8 That said, in practice, I don't think you'll see DOIs in the wild like this very often, thankfully. 😄 @betatim mostly I dropped by to say thanks for taking a look at this pull request!! 🎉 🎉 🎉 (And thank you to @Xarthisius for making it, of course!! 🎉 🎉 🎉 ) I have some thoughts on URLs to pass to mybinder.org (what's possible today from "external tools" in Dataverse) but I decided to throw it into a comment in our "Binderverse" issue for now: IQSS/dataverse#4714 (comment) . I'm wondering where to have the conversation about Binder URLs. I assume the BinderHub repo but wherever is fine with me. Please let me know. 😄 @Xarthisius is aware that Dataverse supports an alternative to DOI called Handle. Collectively, people seem to refer to DOIs, Handles, and other identifiers as Persistent IDs (PIDs). There's even a conference called https://pidapalooza.org some colleague of mine attended earlier in the year and a new forum at https://www.pidforum.org where we can ask questions about PIDs. http://hdl.handle.net/11529/10016 is an example of what a Handle might look like. When you click it, it should resolve to a landing page on an installation of Dataverse that looks like the screenshot below. I don't have the numbers on the 46 installations of Dataverse but I'd say most of them use DOIs rather than Handles. I can't emphasize enough how thrilled I am about this pull request. A few weeks ago a gave a demo of launching a Jupyter Notebook from a dataset in Dataverse using Whole Tale and while I did touch on Binder in my talk, I'd love to do a similar demo with Binder in the future. You'll find a Binder screenshot and a transcript in the write up of my talk: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse |
After a quick browse on the train to EuroPython: LGTM 🎉! A few questions for my education, maybe we can turn some into comments in the code so people from the future find them.
For the code formatting pedant in me: can we stick to I look forward to tests and refactoring. (Will the |
Yes! Once you know the hostname of the potential Dataverse installation (due to the DOI resolving to it), you can check if the host indeed is running Dataverse by going to /api/info/version like this: https://demo.dataverse.org/api/info/version The expected response is something like this:
Obviously, this could be some other software but I suspect it's enough of a fingerprint to be pretty sure it's Dataverse. 😄
One of the things to look out for here is that Dataverse installations can limit the size of the zip file that is constructed on the fly based this (potentially long) list of file IDs. The default is 100 MB as mentioned at http://guides.dataverse.org/en/4.15.1/installation/config.html#zipdownloadlimit . If you can't get all the files at once, you could switch to downloading them individually. |
Needs an update if #788 is merged, overlapping refactoring of an abstract DOI-provider class. |
@nuest hi! Now that your pull request has been merged, are you able to help me and @Xarthisius resolve merge conflicts in this one? Any help would be much appreciated!! 🎉 😄 |
5c68444
to
bf1a5a1
Compare
@Xarthisius thanks for resolving those merge conflicts!! 🎉 🎉 🎉 One thing I noticed over at pull request #788 is that @nuest had "WIP" in the title of his pull request initially but @betatim encourage him to switch it to "MRG" at #788 (comment) saying, "We also change the title from WIP to MRG when it is ready for reviewing (and then maybe merge)." So should we switch the title to MRG? Are you ready for review? If not, is there any help you need? It looks like the entire process is documented at https://repo2docker.readthedocs.io/en/latest/contributing/contributing.html#guidelines-to-getting-a-pull-request-merged and here's a screenshot with the MRG bit highlighted in blue: Thanks!!! |
Thank you @pdurbin, I'm aware of r2d contributing process. This PR is still a WIP cause it's literally a work in progress. If you could kindly read this PR's description, you'll notice a TODO list that I'd like to get done before switching status to MRG. However, if you want to push it across the finish line faster, by all means, just take the code and continue working on it. |
@Xarthisius perfect! Thanks for explaining! I was completely unaware of the repo2docker contribution process before this week but it makes sense that because you build on top of this project you know all about it. Probably the best way for me to help is with advocating for the "add config to test Dataverse support in repo2docker" issue over at IQSS/dataverse-jenkins#13 Additionally, I could put a call out on the dataverse-community mailing list to get you some help on the code itself. Just say the word and I'll put out the call! Thanks!!! |
bf1a5a1
to
3852f7c
Compare
ed4b401
to
063fd4d
Compare
I believe it's now in a "reviewable" state. I significantly reduced the complexity of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions and comments from my side.
Thanks for merging the upstream changes. It's great to see more data sources coming :-).
search_url = urlunparse( | ||
parsed_url._replace(path="/api/search", query=search_query) | ||
) | ||
resp = self.urlopen(search_url).read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to log.debug
the URL before you're calling it, could be important error-finding information for users.
self.log.debug(" - resp = " + json.dumps(data)) | ||
return | ||
|
||
self.record_id = deep_get(data, "items.0.dataset_persistent_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be a case where more than 1 item will be returned? Should you warn about that here?
|
||
self.record_id = deep_get(data, "items.0.dataset_persistent_id") | ||
elif is_doi(doi): | ||
self.record_id = "doi:" + normalize_doi(doi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other DOI-supporting content providers we only store the repository specific identifier here, not the full DOI. Can you briefly confirm that this does not work here, because multiple Dataverse installations can have different DOI schemes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nuest Dataverse supports both DOIs and Handles and collectively we refer to these as "persistent identifiers" or PIDs.
- DOI example: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP
- Handle example: https://data.cimmyt.org/dataset.xhtml?persistentId=hdl:11529/10016
Does this help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK, I see. Thanks!
# and then receive a zip file containing all of them. | ||
# TODO: Dataverse has a limit for the zipfile size (see | ||
# https://github.com/jupyter/repo2docker/pull/739#issuecomment-510834729) | ||
# If size of the dataset is grater than 100MB individual files should be downloaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to open an issue for that in repo2docker and remove the TODO here.
OTOH, why not only implement an individual file download?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here, but I got the impression that a scenario with a lot of small files is way more popular than just a few big files case in Dataverse repositories. Going with that hunch, I'd rather see one API call than a bazillion calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to explain the pros and cons of downloading files individually from a Dataverse dataset vs. downloading them as a zip.
Pros of downloading files individually
- Easier on Dataverse servers that store files on S3 because a "redirect" feature can be enabled such that files are downloading directly from S3 rather than passing through Dataverse servers.
- Easier on Dataverse servers because Dataverse servers do not have to create a zip file on the fly
Cons of downloading files individually
- In order to know the file hierarchy, you have to parse metadata about the files and look for
directoryLabel
to get path/to/file information. (There can be other potentially interesting metadata about files in there, by the way, such as descriptions.) - Lots of requests. Bazillion calls.
Pros of downloading zip files
- Single request.
- Easier to deal with a single file. Just unpack it. The file hierarchy is in the zip file.
Cons of downloading zip files
- May crash Dataverse servers. Dataverse servers have to construct the zip on the fly. Also when not using the S3 redirect option mentioned above, files pass through Glassfish and Apache.
I hope this helps! I might be forgetting a point or two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to explain the pros and cons of downloading files individually from a Dataverse dataset vs. downloading them as a zip.
Pros of downloading files individually
- Easier on Dataverse servers that store files on S3 because a "redirect" feature can be enabled such that files are downloading directly from S3 rather than passing through Dataverse servers.
- Easier on Dataverse servers because Dataverse servers do not have to create a zip file on the fly
Cons of downloading files individually
- In order to know the file hierarchy, you have to parse metadata about the files and look for
directoryLabel
to get path/to/file information. (There can be other potentially interesting metadata about files in there, by the way, such as descriptions.) - Lots of requests. Bazillion calls.
Pros of downloading zip files
- Single request.
- Easier to deal with a single file. Just unpack it. The file hierarchy is in the zip file.
Cons of downloading zip files
- May crash Dataverse servers. Dataverse servers have to construct the zip on the fly. Also when not using the S3 redirect option mentioned above, files pass through Glassfish and Apache.
I hope this helps! I might be forgetting a point or two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the elaborate description! That's very helpful.
Re. the directoryLabel
: that is what is done now in the Zenodo and Figshare providers although I don't think directories are even possible in those repositories, see how host["download"]
and host["filename"]
are used in https://github.com/jupyter/repo2docker/blob/master/repo2docker/contentproviders/doi.py#L53
@betatim and @minrk should comment on what is preferable for BinderHub operators.
From a user perspective, I tend towards downloading files individually, because I want things not to break (e.g. if the ZIP would larger then 1 GB) and don't care about load on the servers. The comment says
"Dataverse has a limit for the zipfile size"
so I think there is no way around implementing individual download (maybe as a second step if the zipfile download fails) unless you'd be fine with not supporting large repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if it's easier to implement zip download first just to get something working, that's fine with me. It should work fine with datasets with small files. 😄 As you see, I'm just trying to explain the pitfalls of creating zip files on the fly from the Dataverse server perspective. (And it's a bad experience for users if the zip file isn't complete, if it only has 5 of the 17 files the user wants because they exceeded the zip size limit.) So yes, downloading individual files, even if it's a bazillion requests, is probably safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reimplemented fetch to use individual files download
|
||
@property | ||
def content_id(self): | ||
"""The Dataverse persistent identifier (could use internal dataset_id too).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you stick to the doi:
prefix, suggest to mention that here. Also, remove the "could use internal dataset_id", if that is not used eventually - see comment above.
@Xarthisius @nuest I left a couple comments just now. I hope they're helpful. Please let me know if you have any questions! Thanks! |
dc23326
to
30375d1
Compare
None, | ||
) | ||
if host is None: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check I parsed the above lines properly (the use of next()
to handle the "didn't work" case made me read it a few times), is the following code equivalent (I think yes):
for host_ in self.hosts:
if urlparse(host_["url"]).netloc == parsed_url.netloc:
host = host_
else:
return
LGTM! Not quite sure why codecov is (that0 unhappy. Merging this now. Thanks for the work and then coming back to finish it off :) Two more todos for a quick PR:
|
Working on the docs update PR... |
[MRG] Dataverse content provider
This PR add support for Dataverse datasets as content provider for repo2docker, e.g.:
It's very much a WIP, but I'd like to solicit feedback at this point if that's a good way to go. Related to IQSS/dataverse#4714
Since it's very similar to Zenodo provider, I abstracted parts of it into base
DOIHandler
class. Additional refactoring is probably still possible, as I duplicated a lot of code.TODO:
fetch
to remove code duplicated with Zenodo content provider.