[MRG] Dataverse content provider #739

Xarthisius · 2019-07-10T23:07:01Z

This PR add support for Dataverse datasets as content provider for repo2docker, e.g.:

repo2docker doi:10.7910/DVN/6ZXAGT
repo2docker https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/6ZXAGT/3YRRYJ

It's very much a WIP, but I'd like to solicit feedback at this point if that's a good way to go. Related to IQSS/dataverse#4714

Since it's very similar to Zenodo provider, I abstracted parts of it into base DOIHandler class. Additional refactoring is probably still possible, as I duplicated a lot of code.

TODO:

add tests
refactor fetch to remove code duplicated with Zenodo content provider.

betatim · 2019-07-11T18:16:19Z

Do we want to allow repo2docker doi:10.7910/DVN/6ZXAGT? It looks like a URL but isn't. I don't think we can make it doi:// can we/should we? Do we even need a prefix or can we keep it prefix-less? The only time we end up with a problem is when there is a directory with the same name as the DOI and the user didn't mean to use it. I'd be Ok with saying that is a rare edge case.

Is it "standard" to do something like doi:...? I have no idea :-/

Xarthisius · 2019-07-11T18:23:19Z

Do we want to allow repo2docker doi:10.7910/DVN/6ZXAGT? It looks like a URL but isn't. I don't think we can make it doi:// can we/should we? Do we even need a prefix or can we keep it prefix-less? The only time we end up with a problem is when there is a directory with the same name as the DOI and the user didn't mean to use it. I'd be Ok with saying that is a rare edge case.

Is it "standard" to do something like doi:...? I have no idea :-/

I'm certainly not an expert in this field, but there are other types of persistent digital objects identifiers (DOI is kinda a loaded term...), one example is handle.net (e.g. hdl:1902.1/12379). They have similar structure as you can see some_number_prefix/some_other_number and they resolve the same way. Check out:

curl -IL https://doi.org/hdl:1902.1/12379

With Zenodo you just got lucky ;)

I don't know if that's the best example, cause doi:1902.1/12379 and 1902.1/12379 works too, but I'm sure @pdurbin can give us some examples.

betatim · 2019-07-11T21:07:28Z

I am now thinking that I misunderstood what you are proposing :) I thought you were suggesting that we should make repo2docker doi:1902.1/12379 the required way of specifying a DOI and remove repo2docker 1902.1/12379 as a legal option. Now I think you are not suggesting that and instead pointing out that DOIs come in many flavours and that we can treat them all at the same time. If that isn't right then please help me ;)

curl -IL https://doi.org/hdl:1902.1/12379
curl -IL https://doi.org/doi:1902.1/12379
curl -IL https://doi.org/1902.1/12379
curl -IL https://doi.org/doi:10.5281/zenodo.3208700
curl -IL https://doi.org/10.5281/zenodo.3208700

all seem to be legal and "do the right thing". Which is pretty cool actually :) Now I'm thinking that we should allow both forms and encourage people to use one with "prefix" when there could be ambiguity (is it true that no prefix means "use the default" and the default is "doi:"?).

i will try and take a look at the code tomorrow.

pdurbin · 2019-07-11T22:25:36Z

DOIs come in many flavours

Yes. The craziest DOI I'm aware of is https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

That said, in practice, I don't think you'll see DOIs in the wild like this very often, thankfully. 😄

@betatim mostly I dropped by to say thanks for taking a look at this pull request!! 🎉 🎉 🎉 (And thank you to @Xarthisius for making it, of course!! 🎉 🎉 🎉 ) I have some thoughts on URLs to pass to mybinder.org (what's possible today from "external tools" in Dataverse) but I decided to throw it into a comment in our "Binderverse" issue for now: IQSS/dataverse#4714 (comment) . I'm wondering where to have the conversation about Binder URLs. I assume the BinderHub repo but wherever is fine with me. Please let me know. 😄

@Xarthisius is aware that Dataverse supports an alternative to DOI called Handle. Collectively, people seem to refer to DOIs, Handles, and other identifiers as Persistent IDs (PIDs). There's even a conference called https://pidapalooza.org some colleague of mine attended earlier in the year and a new forum at https://www.pidforum.org where we can ask questions about PIDs.

http://hdl.handle.net/11529/10016 is an example of what a Handle might look like. When you click it, it should resolve to a landing page on an installation of Dataverse that looks like the screenshot below.

I don't have the numbers on the 46 installations of Dataverse but I'd say most of them use DOIs rather than Handles.

I can't emphasize enough how thrilled I am about this pull request. A few weeks ago a gave a demo of launching a Jupyter Notebook from a dataset in Dataverse using Whole Tale and while I did touch on Binder in my talk, I'd love to do a similar demo with Binder in the future. You'll find a Binder screenshot and a transcript in the write up of my talk: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

betatim · 2019-07-12T06:48:00Z

After a quick browse on the train to EuroPython: LGTM 🎉!

A few questions for my education, maybe we can turn some into comments in the code so people from the future find them.

the dataverse.json is needed to detect if the link we get from resolving a DOI points to a dataverse instance right? We should add instructions on how to get and update this file. I assume it'll need updating every few months? Is there a way to make a request to a "potential dataverse" host to find out if it is indeed one? I checked the Server response header but thtat doesn't contain anything. Maybe there is a well known endpoint like /versions or /status or some such that we could query instead of having a list that we will need to update?
when fetching files we build a list of file IDs we want to fetch and then receive a zip file containing all of them? I left a comment in fetch where maybe we can add a comment. I read the code a few times to convince myself that it did do that (I think), so a comment would help

For the code formatting pedant in me: can we stick to snake_case as much as possible please?

I look forward to tests and refactoring. (Will the is_doi() functions keep working with doi:1234.2/3425?)

pdurbin · 2019-07-12T10:19:38Z

I checked the Server response header but thtat doesn't contain anything. Maybe there is a well known endpoint like /versions or /status or some such that we could query instead of having a list that we will need to update?

Yes! Once you know the hostname of the potential Dataverse installation (due to the DOI resolving to it), you can check if the host indeed is running Dataverse by going to /api/info/version like this:

https://demo.dataverse.org/api/info/version

The expected response is something like this:

{
  "status": "OK",
  "data": {
    "version": "4.15.1",
    "build": "1377-701b56b"
  }
}

Obviously, this could be some other software but I suspect it's enough of a fingerprint to be pretty sure it's Dataverse. 😄

when fetching files we build a list of file IDs we want to fetch and then receive a zip file containing all of them?

One of the things to look out for here is that Dataverse installations can limit the size of the zip file that is constructed on the fly based this (potentially long) list of file IDs. The default is 100 MB as mentioned at http://guides.dataverse.org/en/4.15.1/installation/config.html#zipdownloadlimit . If you can't get all the files at once, you could switch to downloading them individually.

nuest · 2019-09-09T06:42:03Z

Needs an update if #788 is merged, overlapping refactoring of an abstract DOI-provider class.

Xarthisius · 2019-09-09T15:58:11Z

Needs an update if #788 is merged, overlapping refactoring of an abstract DOI-provider class.

Thanks for the heads up @nuest ! I'll rebase when it happens. I'd really love to finish this off, but I'm constantly swamped with other work :(

pdurbin · 2019-09-12T13:34:13Z

@nuest hi! Now that your pull request has been merged, are you able to help me and @Xarthisius resolve merge conflicts in this one? Any help would be much appreciated!! 🎉 😄

pdurbin · 2019-09-13T12:24:39Z

@Xarthisius thanks for resolving those merge conflicts!! 🎉 🎉 🎉 One thing I noticed over at pull request #788 is that @nuest had "WIP" in the title of his pull request initially but @betatim encourage him to switch it to "MRG" at #788 (comment) saying, "We also change the title from WIP to MRG when it is ready for reviewing (and then maybe merge)."

So should we switch the title to MRG? Are you ready for review? If not, is there any help you need?

It looks like the entire process is documented at https://repo2docker.readthedocs.io/en/latest/contributing/contributing.html#guidelines-to-getting-a-pull-request-merged and here's a screenshot with the MRG bit highlighted in blue:

Thanks!!!

Xarthisius · 2019-09-13T12:43:28Z

Thank you @pdurbin, I'm aware of r2d contributing process. This PR is still a WIP cause it's literally a work in progress. If you could kindly read this PR's description, you'll notice a TODO list that I'd like to get done before switching status to MRG.

However, if you want to push it across the finish line faster, by all means, just take the code and continue working on it.

pdurbin · 2019-09-13T12:51:20Z

@Xarthisius perfect! Thanks for explaining! I was completely unaware of the repo2docker contribution process before this week but it makes sense that because you build on top of this project you know all about it. Probably the best way for me to help is with advocating for the "add config to test Dataverse support in repo2docker" issue over at IQSS/dataverse-jenkins#13

Additionally, I could put a call out on the dataverse-community mailing list to get you some help on the code itself. Just say the word and I'll put out the call!

Thanks!!!

Xarthisius · 2019-09-13T18:45:35Z

I believe it's now in a "reviewable" state. I significantly reduced the complexity of detect(). Right now it doesn't handle all the possible cases, but I focused on the two that are IMHO gonna be most common: doi pointing to a Dataverse resource and url created by Dataverse's External Tools.

nuest

Just a few questions and comments from my side.

Thanks for merging the upstream changes. It's great to see more data sources coming :-).

repo2docker/contentproviders/dataverse.py

nuest · 2019-09-16T06:34:12Z

repo2docker/contentproviders/dataverse.py

+            search_url = urlunparse(
+                parsed_url._replace(path="/api/search", query=search_query)
+            )
+            resp = self.urlopen(search_url).read()


Suggest to log.debug the URL before you're calling it, could be important error-finding information for users.

nuest · 2019-09-16T06:35:14Z

repo2docker/contentproviders/dataverse.py

+                self.log.debug("  - resp = " + json.dumps(data))
+                return
+
+            self.record_id = deep_get(data, "items.0.dataset_persistent_id")


Could there be a case where more than 1 item will be returned? Should you warn about that here?

nuest · 2019-09-16T06:37:13Z

repo2docker/contentproviders/dataverse.py

+
+            self.record_id = deep_get(data, "items.0.dataset_persistent_id")
+        elif is_doi(doi):
+            self.record_id = "doi:" + normalize_doi(doi)


In the other DOI-supporting content providers we only store the repository specific identifier here, not the full DOI. Can you briefly confirm that this does not work here, because multiple Dataverse installations can have different DOI schemes?

@nuest Dataverse supports both DOIs and Handles and collectively we refer to these as "persistent identifiers" or PIDs.

DOI example: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP

Handle example: https://data.cimmyt.org/dataset.xhtml?persistentId=hdl:11529/10016

Does this help?

Ah OK, I see. Thanks!

nuest · 2019-09-16T06:41:00Z

repo2docker/contentproviders/dataverse.py

+        # and then receive a zip file containing all of them.
+        # TODO: Dataverse has a limit for the zipfile size (see
+        # https://github.com/jupyter/repo2docker/pull/739#issuecomment-510834729)
+        # If size of the dataset is grater than 100MB individual files should be downloaded.


Suggest to open an issue for that in repo2docker and remove the TODO here.

OTOH, why not only implement an individual file download?

I might be wrong here, but I got the impression that a scenario with a lot of small files is way more popular than just a few big files case in Dataverse repositories. Going with that hunch, I'd rather see one API call than a bazillion calls.

Let me try to explain the pros and cons of downloading files individually from a Dataverse dataset vs. downloading them as a zip.

Pros of downloading files individually

Easier on Dataverse servers that store files on S3 because a "redirect" feature can be enabled such that files are downloading directly from S3 rather than passing through Dataverse servers.

Easier on Dataverse servers because Dataverse servers do not have to create a zip file on the fly

Cons of downloading files individually

In order to know the file hierarchy, you have to parse metadata about the files and look for directoryLabel to get path/to/file information. (There can be other potentially interesting metadata about files in there, by the way, such as descriptions.)

Lots of requests. Bazillion calls.

Pros of downloading zip files

Single request.

Easier to deal with a single file. Just unpack it. The file hierarchy is in the zip file.

Cons of downloading zip files

May crash Dataverse servers. Dataverse servers have to construct the zip on the fly. Also when not using the S3 redirect option mentioned above, files pass through Glassfish and Apache.

I hope this helps! I might be forgetting a point or two.

Thanks for the elaborate description! That's very helpful.

Re. the directoryLabel: that is what is done now in the Zenodo and Figshare providers although I don't think directories are even possible in those repositories, see how host["download"] and host["filename"] are used in https://github.com/jupyter/repo2docker/blob/master/repo2docker/contentproviders/doi.py#L53

@betatim and @minrk should comment on what is preferable for BinderHub operators.

From a user perspective, I tend towards downloading files individually, because I want things not to break (e.g. if the ZIP would larger then 1 GB) and don't care about load on the servers. The comment says

"Dataverse has a limit for the zipfile size"

so I think there is no way around implementing individual download (maybe as a second step if the zipfile download fails) unless you'd be fine with not supporting large repos.

I mean, if it's easier to implement zip download first just to get something working, that's fine with me. It should work fine with datasets with small files. 😄 As you see, I'm just trying to explain the pitfalls of creating zip files on the fly from the Dataverse server perspective. (And it's a bad experience for users if the zip file isn't complete, if it only has 5 of the 17 files the user wants because they exceeded the zip size limit.) So yes, downloading individual files, even if it's a bazillion requests, is probably safer.

I reimplemented fetch to use individual files download

nuest · 2019-09-16T06:41:54Z

repo2docker/contentproviders/dataverse.py

+
+    @property
+    def content_id(self):
+        """The Dataverse persistent identifier (could use internal dataset_id too)."""


If you stick to the doi: prefix, suggest to mention that here. Also, remove the "could use internal dataset_id", if that is not used eventually - see comment above.

setup.py

tests/unit/contentproviders/test_dataverse.py

pdurbin · 2019-09-16T19:25:44Z

@Xarthisius @nuest I left a couple comments just now. I hope they're helpful. Please let me know if you have any questions! Thanks!

betatim · 2019-09-18T05:14:34Z

repo2docker/contentproviders/dataverse.py

+            None,
+        )
+        if host is None:
+            return


To check I parsed the above lines properly (the use of next() to handle the "didn't work" case made me read it a few times), is the following code equivalent (I think yes):

for host_ in self.hosts: if urlparse(host_["url"]).netloc == parsed_url.netloc: host = host_ else: return

betatim · 2019-09-18T06:07:21Z

LGTM! Not quite sure why codecov is (that0 unhappy. Merging this now. Thanks for the work and then coming back to finish it off :)

Two more todos for a quick PR:

add an entry to the changelog
mention Figshare and Dataverse support in the first paragraph of https://repo2docker.readthedocs.io/

nuest · 2019-09-18T06:41:22Z

Working on the docs update PR...

[MRG] Dataverse content provider

Xarthisius force-pushed the dataverse branch 2 times, most recently from 7430652 to 3b84c52 Compare July 11, 2019 00:25

This was referenced Jul 11, 2019

add config to test Dataverse support in repo2docker IQSS/dataverse-jenkins#13

Open

Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes) IQSS/dataverse#4714

Closed

This was referenced Jul 15, 2019

Import from Dataverse SwissDataScienceCenter/renku-python#536

Closed

Dataverse repoprovider and URLs jupyterhub/binderhub#900

Closed

willingc added the workflow:wip Work in Progress label Sep 7, 2019

nuest mentioned this pull request Sep 9, 2019

[MRG] Add Figshare content provider #788

Merged

Xarthisius force-pushed the dataverse branch 2 times, most recently from 5c68444 to bf1a5a1 Compare September 12, 2019 13:54

Xarthisius force-pushed the dataverse branch from bf1a5a1 to 3852f7c Compare September 13, 2019 16:59

Xarthisius added 3 commits September 13, 2019 13:42

Add Dataverse content provider

331a610

Add tests

8edafd0

Add a command for generating dataverse data file

063fd4d

Xarthisius force-pushed the dataverse branch from ed4b401 to 063fd4d Compare September 13, 2019 18:42

Xarthisius changed the title ~~[WIP] Dataverse content provider~~ [MRG] Dataverse content provider Sep 13, 2019

nuest reviewed Sep 16, 2019

View reviewed changes

Improve debug logging

14b758d

Use the persistent id from the url regardless whether it's a DOI or not

30375d1

Xarthisius force-pushed the dataverse branch from dc23326 to 30375d1 Compare September 16, 2019 19:29

Xarthisius added 2 commits September 16, 2019 15:03

Match DV hosts based on netloc instead of url

a2f8228

Download individual files instead of zip bundle

4df4fd6

betatim reviewed Sep 18, 2019

View reviewed changes

betatim merged commit c98e6ac into jupyterhub:master Sep 18, 2019

henchbot mentioned this pull request Sep 18, 2019

repo2docker: 0.10.0-59.g649a2c6...0.10.0-78.gbfbec34 jupyterhub/mybinder.org-deploy#1164

Merged

nuest mentioned this pull request Sep 18, 2019

[MRG] Add docs for Figshare and Dataverse; update Dataverse installations file formatting #796

Merged

henchbot mentioned this pull request Sep 18, 2019

repo2docker: 0.10.0-59.g649a2c6...0.10.0-80.g2ee3230 jupyterhub/mybinder.org-deploy#1166

Merged

pdurbin mentioned this pull request Jan 11, 2020

Add ability to download zip of multiple files IQSS/dataverse-client-r#47

Merged

3 tasks

Xarthisius mentioned this pull request Jun 4, 2020

Download from Zenodo using a DOI instead of URL fatiando/pooch#177

Closed

markmo pushed a commit to markmo/repo2docker that referenced this pull request Jan 22, 2021

[MRG] Dataverse content provider (jupyterhub#739)

4197fe3

[MRG] Dataverse content provider

pdurbin mentioned this pull request Oct 5, 2022

Use/Maintain Appropriate File Formats for Preservation and Reproducibility IQSS/dataverse#6006

Closed

pdurbin mentioned this pull request Feb 7, 2023

Dataverse content provider: download files in original format #1242

Closed

pdurbin mentioned this pull request Mar 28, 2023

[MRG] download original file formats from Dataverse #1242 #1253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Dataverse content provider #739

[MRG] Dataverse content provider #739

Xarthisius commented Jul 10, 2019 •

edited

Loading

betatim commented Jul 11, 2019

Xarthisius commented Jul 11, 2019 •

edited

Loading

betatim commented Jul 11, 2019

pdurbin commented Jul 11, 2019

betatim commented Jul 12, 2019

pdurbin commented Jul 12, 2019

nuest commented Sep 9, 2019

Xarthisius commented Sep 9, 2019

pdurbin commented Sep 12, 2019

pdurbin commented Sep 13, 2019

Xarthisius commented Sep 13, 2019

pdurbin commented Sep 13, 2019

Xarthisius commented Sep 13, 2019 •

edited

Loading

nuest left a comment

nuest Sep 16, 2019

nuest Sep 16, 2019

nuest Sep 16, 2019

pdurbin Sep 16, 2019

nuest Sep 18, 2019

nuest Sep 16, 2019

Xarthisius Sep 16, 2019

pdurbin Sep 16, 2019

pdurbin Sep 16, 2019

nuest Sep 17, 2019

pdurbin Sep 17, 2019

Xarthisius Sep 17, 2019

nuest Sep 16, 2019

pdurbin commented Sep 16, 2019

betatim Sep 18, 2019

betatim commented Sep 18, 2019

nuest commented Sep 18, 2019

[MRG] Dataverse content provider #739

[MRG] Dataverse content provider #739

Conversation

Xarthisius commented Jul 10, 2019 • edited Loading

betatim commented Jul 11, 2019

Xarthisius commented Jul 11, 2019 • edited Loading

betatim commented Jul 11, 2019

pdurbin commented Jul 11, 2019

betatim commented Jul 12, 2019

pdurbin commented Jul 12, 2019

nuest commented Sep 9, 2019

Xarthisius commented Sep 9, 2019

pdurbin commented Sep 12, 2019

pdurbin commented Sep 13, 2019

Xarthisius commented Sep 13, 2019

pdurbin commented Sep 13, 2019

Xarthisius commented Sep 13, 2019 • edited Loading

nuest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pros of downloading files individually

Cons of downloading files individually

Pros of downloading zip files

Cons of downloading zip files

Choose a reason for hiding this comment

Pros of downloading files individually

Cons of downloading files individually

Pros of downloading zip files

Cons of downloading zip files

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdurbin commented Sep 16, 2019

Choose a reason for hiding this comment

betatim commented Sep 18, 2019

nuest commented Sep 18, 2019

Xarthisius commented Jul 10, 2019 •

edited

Loading

Xarthisius commented Jul 11, 2019 •

edited

Loading

Xarthisius commented Sep 13, 2019 •

edited

Loading