Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataverse to UI. Fixes #900 #969

Merged
merged 4 commits into from
Dec 10, 2019

Conversation

Xarthisius
Copy link
Contributor

Note: This PR depends on an unreleased version of repo2docker

This adds Dataverse DOI as a valid target for BinderHub UI. Tests are expected to fail right now as I developed it against unreleased version of WT, where there's a slight change to DOI format returned by the lookup endpoint ("doi:10..." vs "10..."). However, it doesn't affect functionality and this PR in any way. Everything should be working as is.

@betatim
Copy link
Member

betatim commented Oct 2, 2019

Could you explain a bit why we need to involve an additional service instead of something like the following:

class DataverseProvider(RepoProvider):
    """Provide contents of a Dataverse record
    Users must provide a spec consisting of the Dataverse DOI.
    """
    name = Unicode("Dataverse")

    @gen.coroutine
    def get_resolved_ref(self):
        client = AsyncHTTPClient()
        req = HTTPRequest("https://doi.org/{}".format(self.spec),
                          user_agent="BinderHub")
        r = yield client.fetch(req)
        self.record_id = urllib.parse.parse_qs(urllib.parse.urlparse(r.effective_url).query)
        return self.record_id

    def get_repo_url(self):
        # While called repo URL, the return value of this function is passed
        # as argument to repo2docker, hence we return the spec as is.
        return self.spec

    def get_build_slug(self):
        return "dataverse-{}".format(self.record_id)

I tried to get a response as JSON but that doesn't seem to work :-(

@Xarthisius
Copy link
Contributor Author

Could you explain a bit why we need to involve an additional service instead of something like the following:

In order to avoid implementing this and this again.

I tried to get a response as JSON but that doesn't seem to work :-(

From data.wholetale.org?

$ curl -s --header 'Accept: application/json' 'https://data.wholetale.org/api/v1/repository/lookup?dataId=%5B%22doi%3A10.7910%2FDVN%2FTJCLKP%22%5D' | jq
[
  {
    "dataId": "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP",
    "doi": "10.7910/DVN/TJCLKP",
    "name": "Open Source at Harvard",
    "repository": "Dataverse",
    "size": 760854
  }
]

@Xarthisius
Copy link
Contributor Author

        req = HTTPRequest("https://doi.org/{}".format(self.spec),
                          user_agent="BinderHub")
        r = yield client.fetch(req)
        self.record_id = urllib.parse.parse_qs(urllib.parse.urlparse(r.effective_url).query)

If you're not interested in verifying that DOI belongs to Dataverse, why bother with resolving it at all? You can check if it's doi via regex and and generate some checksum of it for record_id.

@betatim
Copy link
Member

betatim commented Oct 2, 2019

I think there is no need to know if a DOI points to a dataverse instance or not, repo2docker will tell us that once we give it the DOI.

What BinderHub needs is an ID that identifies the content and that we can use as a cache key. For Git this means we resolve a "branch name" to a SHA1 which is used as cache key. For DOIs from Zenodo we resolve the DOI to the Zenodo ID because the same DOI can point to different Zenodo IDs over time (that is how they implemented versioning). That is IMHO the main reason for resolving it and we can show an error earlier to the user if the DOI doesn't exist (someone made a typo).

I'm pretty sure you could enter a Dataverse DOI and select Zenodo from the dropdown on mybinder.org right now and it would work (maybe, kinda).

@Xarthisius
Copy link
Contributor Author

Xarthisius commented Oct 2, 2019

I think there is no need to know if a DOI points to a dataverse instance or not, repo2docker will tell us that once we give it the DOI.

Makes sense. That's a cleaner solution, I wasn't sure that's an acceptable choice.

What BinderHub needs is an ID that identifies the content and that we can use as a cache key. For Git this means we resolve a "branch name" to a SHA1 which is used as cache key. For DOIs from Zenodo we resolve the DOI to the Zenodo ID because the same DOI can point to different Zenodo IDs over time (that is how they implemented versioning). That is IMHO the main reason for resolving it and we can show an error earlier to the user if the DOI doesn't exist (someone made a typo).

That applies to dataverse too. However, that information (dataset version) is not a part of the url unfortunately. By default the DOI is pointing to the latest version. Getting that info would require additional API call. I'm not sure how to proceed with it...

I'm pretty sure you could enter a Dataverse DOI and select Zenodo from the dropdown on mybinder.org right now and it would work (maybe, kinda).

Yeah, that's why I wondered whether /doi route on binderhub wouldn't make more sense if we're delegating handling to r2d, but I went with the pattern established by Zenodo/Figshare instead.

@betatim
Copy link
Member

betatim commented Oct 3, 2019

Yeah, that's why I wondered whether /doi route on binderhub wouldn't make more sense if we're delegating handling to r2d, but I went with the pattern established by Zenodo/Figshare instead.

I think until we support "many" (for some definition of many) DOI targets (or what ever we call Zenodo, figshare, etc) it feels better to make it explicit that "type a FooBar DOI here, not any other kind. If you do you shouldn't expect this to work" in terms of messaging to the user.

Getting that info would require additional API call. I'm not sure how to proceed with it...

I was hoping that with accept: application/json we'd get a machine readable response that contained this :-/ Can we derive (for all DataVerse instances) the API URL from the resolved URL?

@Xarthisius Xarthisius force-pushed the 900_dataverse_support branch from d22cb3b to 8424d1c Compare October 4, 2019 14:12
@Xarthisius
Copy link
Contributor Author

Getting that info would require additional API call. I'm not sure how to proceed with it...

I was hoping that with accept: application/json we'd get a machine readable response that contained this :-/ Can we derive (for all DataVerse instances) the API URL from the resolved URL?

That's the best I could come up with:

        search_url = urllib.parse.urlunparse(
            urllib.parse.urlparse(r.effective_url)._replace(
                path="/api/datasets/:persistentId"
            )
        )
        req = HTTPRequest(search_url, user_agent="BinderHub")
        r = yield client.fetch(req)
        resp = json.loads(r.body)

I'm not 100% it's always going to be the case. Maybe @pdurbin can chime in.

@pdurbin
Copy link
Contributor

pdurbin commented Oct 4, 2019

Hi! Sorry, I thought I was subscribed to this pull request but I didn't get an email until I was mentioned today.

Zenodo and Dataverse treat DOI differently. (I know less about Figshare, sorry.)

  • A DOI in Dataverse usually points to a dataset (a collection of files). Some installations of Dataverse turn on the ability to assign DOIs to individual files but let's not worry about that right now. I believe you'll want all the files in a dataset in most cases. So from now on, when I say DOI, I mean a DOI to a dataset (again, a collection of files).
  • When you create a dataset in Zenodo and then update that dataset, you now have two DOIs.
  • When you create a dataset in Dataverse and then update that dataset, you still only have one DOI. As @Xarthisius said, the way to distinguish between different versions of a dataset in Dataverse is to pass in the version of the dataset, which can be a 1, 1.1, 1.2, 2, etc. You can also pass in :latest-published as the version to always get the latest published version of the dataset, which is probably what you should always do for now to keep things simple. :latest-published, is documented at http://guides.dataverse.org/en/4.17/api/native-api.html#datasets , versionNumber is documented at http://guides.dataverse.org/en/4.17/api/native-api.html#get-version-of-a-dataset , and from the Dataverse "external tool" perspective (this is more for @Xarthisius for now, but we'd love to have an external tool for Binder some day), {datasetVersion} is documented at http://guides.dataverse.org/en/4.17/api/external-tools.html#reserved-words
  • @Xarthisius has it right for how to get the API URL for an installation of Dataverse based on the hostname. Basically, it's /api. 😄

I hope this helps! Please keep the questions coming!

@Xarthisius Xarthisius changed the title [WIP] Add Dataverse to UI. Fixes #900 Add Dataverse to UI. Fixes #900 Oct 4, 2019
@pdurbin
Copy link
Contributor

pdurbin commented Oct 16, 2019

@betatim it looks like @Xarthisius removed "WIP" from the title of this pull request. Are there any more questions I can answer? Thanks!

@pdurbin
Copy link
Contributor

pdurbin commented Nov 5, 2019

I'm pretty sure you could enter a Dataverse DOI and select Zenodo from the dropdown on mybinder.org right now and it would work (maybe, kinda).

@betatim my curiosity got the better of me and I finally just tried this. I plugged in the DOI from my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP and it looks like files are getting downloaded from Dataverse (Harvard Dataverse in this case) into Binder. But then it hangs? I'll attach some screenshots.

Screen Shot 2019-11-05 at 2 13 00 PM
Screen Shot 2019-11-05 at 2 15 10 PM

(time passes)

Screen Shot 2019-11-05 at 2 18 06 PM

By the way, awesome shout out to Binder at https://www.nature.com/articles/d41586-019-03366-x

All of this is on my mind because we're talking a lot these days about this "Capsulization and Packaging of Replication Objects in Dataverse" issue at IQSS/dataverse#6085 and you and others are very welcome to join in the conversation!

@pdurbin
Copy link
Contributor

pdurbin commented Nov 5, 2019

I just thought to myself, "Who would be excited to see Binder working with Dataverse DOIs." The answer, of course, is @atrisovic and we picked a DOI at random from her recent "Building reproducible workflows for earth sciences - Responding to reproducibility challenges from physics to social sciences" talk: https://vimeo.com/366217811 (slides at https://events.ecmwf.int/event/116/contributions/937/attachments/239/432/Repwork-WS-Trisovic.pdf by the way.

We punched in "10.7910/DVN/VZSO5S" from this dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VZSO5S

Vol_16(4)_Replication_Data_for_Disrupting_Regulation,Regulating_Disruption_The_Politics_of_Uber_in_the_United_States-Perspectives_on_Politics-_2019-11-05_14 37 26

And it worked! When I got back to my desk it said "image already built" like in the screenshots below but I was able to play with the code in Binder.

Screen Shot 2019-11-05 at 2 38 02 PM
Screen Shot 2019-11-05 at 2 38 23 PM
Screen Shot 2019-11-05 at 2 38 38 PM
Screen Shot 2019-11-05 at 2 38 46 PM
Screen Shot 2019-11-05 at 2 39 00 PM

I feel like we're so close to having the "Binderverse" we dreamed of in IQSS/dataverse#4714 🎉 🎉 🎉

@betatim @choldgraf et al., is there anything I can do to help move this pull request forward? 😄

@choldgraf
Copy link
Member

Hey all - just looking through the review and it seems reasonable to me. I'm also a fan of the "make the DOI target an explicit thing instead of just totally generic DOIs" approach - that way we can choose a subset of DOIs that we know will play nicely with BinderHubs.

Are there any lingering "to-do" items on this one? Or are we "just" waiting for a merge?

@Xarthisius
Copy link
Contributor Author

Are there any lingering "to-do" items on this one? Or are we "just" waiting for a merge?

I believe I addressed all Tim's comments. I guess it needs a 2nd pair of eyes?

@Xarthisius Xarthisius force-pushed the 900_dataverse_support branch from c3cfc67 to 1d59c7e Compare November 20, 2019 15:16
CHANGES.md Outdated Show resolved Hide resolved
@@ -294,6 +295,44 @@ def get_build_slug(self):
return "figshare-{}".format(self.record_id)


class DataverseProvider(RepoProvider):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks pretty reasonable to me. One thing I'd ask is if you could write out some short documentation for the spec for this provider. Could you:

  1. Add more contextual information to the docstrings in this class
  2. Add an entry for the Dataverse URL structure to the BinderHub docs here: https://binderhub.readthedocs.io/en/latest/api.html#providers ?
  3. (optionally) not necessary for this PR but, add one for the other repoproviders that are support but not listed there, if you want extra kudos :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's fixed by Phil's commit.

@choldgraf
Copy link
Member

Hey all - just a few quick comments from me, mostly regarding documentation. I think it looks good though!

One quick thought: I saw that this also implements the "resolve the URL with doi.org first, then re-resolve another DOI from the result. Would it be worth having a helper function that we can re-use throughout to resolve doi.org DOIs? I'm thinking this would mostly just be to simplify the conceptual overhead of the module, though could also help with DRY a bit

@choldgraf
Copy link
Member

OK, we just got #1017 merged, so can you follow the instructions here:

https://binderhub.readthedocs.io/en/latest/developer/repoproviders.html#adding-a-new-repository-provider

In particular the bit about documenting this repoprovider, and then we can merge?

@pdurbin
Copy link
Contributor

pdurbin commented Dec 5, 2019

@choldgraf hi! I just opened Xarthisius#1 to merge some docs into this pull request by @Xarthisius

Can you please take a look? Thanks!

This is the the main commit to look at: Xarthisius@df83cc4

Here's how it looks on my machine:

developer page

Screen Shot 2019-12-05 at 2 55 25 PM

reference page

Screen Shot 2019-12-05 at 2 56 03 PM

@pdurbin
Copy link
Contributor

pdurbin commented Dec 5, 2019

@choldgraf @betatim the way this pull request is going it looks like the URL in "Copy the URL below" is going to look something like this:

https://mybinder.org/v2/dataverse/10.7910/DVN/TJCLKP/

Here it is in context:

This pull request (DOI in path)

Screen Shot 2019-12-05 at 4 46 14 PM

Instead of having the DOI in the path like this, would it be possible to have the DOI in a query parameter? I mean something like this:

https://mybinder.org/v2/dataverse?datasetPid=doi:10.7910/DVN/RLLL1V

Here's how that would look:

Proposed (DOI in query parameter)

Screen Shot 2019-12-05 at 4 51 05 PM

The main reason I'm asking this question is that on that call last week (thanks again for having us!) I mentioned that I could give you a JSON file to document how to add and "Explore" button to Dataverse for Binder. The problem is that Dataverse's "external tool" framework (the thing I was calling a glorified URL builder) does not support dynamically changing the path in the external tool. Rather, the base URL of the tool is set and query parameters are used to send dynamic information (like the DOI). That is to say, this is the JSON I had in mind to add a Binder button to Dataverse (as we discussed toolUrl could also be a local installation of Binder):

{
  "displayName": "MyBinder",
  "description": "Analyze in MyBinder",
  "type": "explore",
  "scope": "dataset",
  "toolUrl": "https://mybinder.org/v2/dataverse",
  "toolParameters": {
    "queryParameters": [
      {
        "datasetPid": "{datasetPid}"
      }
    ]
  }
}

And that JSON will send the Dataverse user who clicks "Explore" and then "Binder" to a URL like this (the same as above with a query parameter):

https://mybinder.org/v2/dataverse?datasetPid=doi:10.7910/DVN/RLLL1V

The problem, of course, is that I don't believe the Binder will know what to do with a URL with a query parameter like that.

I noticed in #712 some discussion about query parameters but it seemed to be more about sending additional information beyond the base spec (the provider and the DOI) rather than permitting the base spec (the provider and DOI) to handle query parameters.

In short, if this pull request gets merged, I'd love to be able to help you get a JSON file ready to put in the Binder docs of how to add a Binder button to Dataverse but (again) the Dataverse "external tool" framework always wants to send the DOI of the dataset as a query parameter. The Dataverse external tool framework currently does not have the ability to put the DOI of the dataset in the path of the URL.

I hope this makes sense! Please let me know if anything is unclear!

@choldgraf
Copy link
Member

The UI addition both look good to me! For your second issue about URL parameters etc, I think that'd best be discussed in a new issue, unless you'd like to block this PR on it. What do you think?

@pdurbin
Copy link
Contributor

pdurbin commented Dec 6, 2019

@choldgraf by UI addition I assume you mean the doc addition. The screenshots of the two pages in the docs I updated. Thanks!

I almost created a new issue about the query parameter thing. (And maybe some day I'll open another separate issue about supporting both DOIs and Handles. And yet another issue about accepting a version number for a dataset (1.1, 2.0, etc.)

I'm totally fine with merging this as is and will excitedly tell the Dataverse community about it. I'm a little bummed that an Dataverse external tool won't "just work" right now but I think we can get there eventually.

If it's easiest to merge this and cherry pick my doc commit at Xarthisius@df83cc4 that's fine with me. Whatever works for you. Thanks! 😄

@betatim
Copy link
Member

betatim commented Dec 6, 2019

I'd vote against discussing the adding of the ability for some repo providers to use query parameters in this PR.

@pdurbin
Copy link
Contributor

pdurbin commented Dec 6, 2019

@betatim sounds fine.

Is there anything else you need from me or @Xarthisius ?

@choldgraf
Copy link
Member

choldgraf commented Dec 6, 2019

I believe that @bitnik noticed there are two other methods that should be created for the repository provider, see his comment here:

#967 (comment)

Specifically, get_resolved_spec and get_resolved_ref_url are relatively new to the spec but aren't in this PR.

So to me there are 3 things needed to move forward:

  • Confirm that the repo2docker contentprovider is ready for DataVerse
  • Address get_resolved_spec and get_resolved_ref_url methods comment in this PR
  • Get @pdurbin's docs improvements merged into this branch

@Xarthisius Xarthisius force-pushed the 900_dataverse_support branch from 1d59c7e to b562b61 Compare December 8, 2019 11:21
@pdurbin
Copy link
Contributor

pdurbin commented Dec 8, 2019

@choldgraf thanks. It looks like @Xarthisius addressed these:

  • 8a0cee8 - Address get_resolved_spec and get_resolved_ref_url methods comment in this PR
  • b562b61 - Get @pdurbin's docs improvements merged into this branch

Is there anything you need for "Confirm that the repo2docker contentprovider is ready for Dataverse"?

@Xarthisius
Copy link
Contributor Author

Is there anything you need for "Confirm that the repo2docker contentprovider is ready for Dataverse"?

As per #969 (comment) repo2docker handles Dataverse DOIs on main Binderhub instance even without this PR.

@pdurbin
Copy link
Contributor

pdurbin commented Dec 9, 2019

As per #969 (comment) repo2docker handles Dataverse DOIs on main Binderhub instance even without this PR.

Yeah. @atrisovic also gave a more recent demo using a Dataverse DOI with the Zenodo content provider a couple weeks ago.

Here's the dataset: https://doi.org/10.7910/DVN/EOYZKH or https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EOYZKH

Here's how we used a Dataverse DOI with the Zenodo content provider: https://mybinder.org/v2/zenodo/10.7910/DVN/EOYZKH/

One can find notes from this demo in the "JupyterHub and BinderHub Team Meeting" doc at https://hackmd.io/u2ghJJUCRWK-zRidCFid_Q?view

@choldgraf please advise! 😄

@choldgraf
Copy link
Member

The changes look good to me, and thanks for adding tests as well. I'll plan on merging this tomorrow morning to give time for anybody else to comment or raise a flag, otherwise LGTM

@choldgraf choldgraf merged commit c7b7fe4 into jupyterhub:master Dec 10, 2019
@choldgraf
Copy link
Member

ok, merged! let's see how this looks in the deployed sites once the henchbot makes its PR

@pdurbin
Copy link
Contributor

pdurbin commented Dec 10, 2019

@choldgraf it works!!!! 🎉 🎉 🎉

Thanks for merging! Here are some screenshots:

Entering a Dataverse DOI in Binder

Screen Shot 2019-12-10 at 1 25 08 PM

The files from the Dataverse dataset behind the DOI are in Jupyter Lab

Screen Shot 2019-12-10 at 1 10 39 PM

Clicking "Visit repo" returns me to the dataset in Dataverse

Screen Shot 2019-12-10 at 1 10 50 PM

Thank you so much! I can't wait to let the Dataverse community know!

@pdurbin
Copy link
Contributor

pdurbin commented Dec 10, 2019

Ok, I just gave a shout out to @Xarthisius (❤️ 🎉 ❤️ 🎉 ) in a post I made called "Binder supports Dataverse DOIs for Jupyter Notebooks" on the Dataverse mailing list and encouraged everyone to install Binder: https://groups.google.com/d/msg/dataverse-community/K7vKWSPLaQY/8go3fvV3AwAJ

@betatim
Copy link
Member

betatim commented Dec 11, 2019

Thanks a lot for persevering with getting this over the line!

Some feedback on how I experienced this PR. The enthusiasm is appreciated, understandable and shared! However a lot of comments in this thread are very long. In my opinion they don't contribute to moving things forward proportional to the amount of screenspace they take up. For me it is even the other way around, more comments with more content it makes it harder to move things forward. This is because I take a quick peek at issues with activity and if I can deal with in in 5min I try to do it then and there, otherwise I schedule it for a later moment when I have time, patience and generally in the right frame of mind for dealing with potentially tricky discussions. This means the iteration time increases a lot. I don't feel like I could just skip comments because people might refer to them or because they might contain relevant insights.

So for me a short comment like "yay, it just works if you select Zenodo, try it mybinder.org/v2/zenodo/1234..." is great because it lets us know someone tried it and that it works. Everything else we can probably figure out as people working on PRs are usually "deep into" the technical stuff. If there is a question we can always go deeper on request. When things don't work as expected then more details are needed.

To sum it up, for PRs and issues I try to keep that famous quote of "I apologise for sending such a long letter, I didn't have time to write a shorter one" in mind :) and I sometimes get carried away with excitement as well and post too much.

@pdurbin
Copy link
Contributor

pdurbin commented Dec 11, 2019

@betatim short and sweet from now on! 🍦🍦🍦 Thanks for everything!!! 🎉 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants