Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Binder button for operating on datasets with Jupyter notebooks, Python, R, etc. #208

Closed
pdurbin opened this issue Jan 19, 2023 · 13 comments
Labels
pm.netcdf-hdf5.d NIH ODSS supplement (NetCDF) Size: 3 A percentage of a sprint.
Milestone

Comments

@pdurbin
Copy link
Member

pdurbin commented Jan 19, 2023

Almost three years ago we made a big push thanks to @Xarthisius to teach https://mybinder.org to handle Dataverse datasets.

For anyone not familiar, Binder gives you an environment for computational reproducibility in the cloud, for free!

If there is code in the dataset, great, you can try to execute that code in Binder.

If there is no code in the dataset, no problem, by launching the dataset in Binder, you have an environment in which you can start exploring and writing code.

To be clear, the dataset author doesn't need to use Binder. Anyone wanting to explore the data (especially with code) can use Binder.

@siacus and I have been talking about containers (which Binder spins up for you) and I just gave a demo to @sbarbosadataverse on how we can add a Binder button to every dataset in Harvard Dataverse by loading up the Binder tool via curl (like any external tool).

First some screenshots. On a dev server, here's a copy of my "Open Source at Harvard" dataset looks. Note that a "Binder" button is shown.

Screen Shot 2023-01-19 at 11 57 44 AM

When you click the "Binder" button, in a new tab you'll see Binder spinning up a Docker container with the dataset (code, data, docs, etc.) in it…

Screen Shot 2023-01-19 at 2 15 51 PM

This dataset happens to have two directories, code and data:

Screen Shot 2023-01-19 at 2 16 40 PM

As a proof of concept, here's the execution of a Python script but the sky is the limit it terms of which languages and tools you want to run, such as Jupyter notebooks:

Screen Shot 2023-01-19 at 2 18 42 PM

To load up the tool, use curl like usual

curl -X POST -H 'Content-type: application/json' http://localhost:8080/api/admin/externalTools --upload-file binder.json

Here's the external tool manifest:

{
  "displayName": "Binder",
  "description": "Run on Binder",
  "scope": "dataset",
  "type": "explore",
  "toolUrl": "https://girder.hub.yt/api/v1/ythub/dataverse",
  "toolParameters": {
    "queryParameters": [
      {
        "datasetPid": "{datasetPid}"
      },
      {
        "siteUrl": "{siteUrl}"
      },
      {
        "key": "{apiToken}"
      }
    ]
  }
}

(We are looking for a permanent home for this manifest at data-exp-lab/girder_ythub#10 which should help close IQSS/dataverse#6807 .)

@siacus
Copy link

siacus commented Jan 19, 2023

This is different from running a container with a proper workflow inside. The Binder container probably contains some preinstalled version of python/R and libs. Your script may work or not, or work but producing different result. The Binder approach can be seen more as an exploratory tool than a reproducibility feature (which is what I am planning to have for DV). Definitely interesting.

@Xarthisius
Copy link

This is different from running a container with a proper workflow inside. The Binder container probably contains some preinstalled version of python/R and libs. Your script may work or not, or work but producing different result. The Binder approach can be seen more as an exploratory tool than a reproducibility feature (which is what I am planning to have for DV). Definitely interesting.

You piqued my interest when you mentioned "reproducibility". Maybe you'd be interested in exploring Whole Tale, which integrates with DV directly and is "reproducibility-oriented" ?

@pdurbin
Copy link
Member Author

pdurbin commented Jan 19, 2023

@siacus Binder might be more flexible than you imagine. 😄

You can definitely specify the version of Python/R and libs:

You can even specify your own Dockerfile:

And here are some docs on reproducibility:

That said, I'd certainly be happy to enable a Whole Tale button on Harvard Dataverse too, like @Xarthisius suggests! We already enabled it on demo:

Craig Willis recorded a couple nice videos about using Whole Tale with Dataverse:

@craig-willis
Copy link

@siacus I'd certainly be interested in hearing your thoughts on reproducibility and Dataverse (maybe outside of this issue, if interested). The Whole Tale project has been working on this for a while, but never fully closed the loop with Dataverse. (@Xarthisius Binder/DV integration actually comes out of WT.)

I don't want to hijack the issue here, but would be happy to share information about the similarities/differences between the platforms.

@atrisovic
Copy link
Member

atrisovic commented Jan 20, 2023

Hi All, very happy to see we are bringing back this discussion, and that @siacus is thinking about reproducibility. I have thought about it and have a lot of opinions on it!

I just left a comment on #6807 and this is what I'm thinking:

We can enable proactive reproducibility in Dataverse at dataset upload stage by either having a prompt to ask the uploader for external dependencies or by adding a standard Dockerfile based on what is in the dataset. I can expand on this later as it is not immediately related to Binder.

However, Dataverse already has thousands of 'replication datasets' that contain code without an environment. This is why we should think of ways to enable retroactive reproducibility as best as we can. This can already be facilitated with Binder. In particular, for datasets that contain R/Python code or specific file formats (NetCDF/parquet/Rdata) it would be very useful to get a Dockerfile/env.yml on the fly so that a ready-to-go environment is initiated in Binder. I wonder if there is a spot in the Dataverse transfer to Binder where that could be smoothly implemented.

@pdurbin
Copy link
Member Author

pdurbin commented Jan 20, 2023

@atrisovic a while ago I experimented with opening Binder with a specific version of R and the right libs. Then I tried with Python:

Definitely an (old) work in progress, so I'm not sure if it even works anymore but in short for R you can have a file called install.R and for Python you can have requirements.txt. That said, I'd check the Binder docs for the latest. 😄

@craig-willis I'm just waiting for you to invite me back to Chicago for another reproducibility workshop! 🎉

@pdurbin pdurbin added the Size: 3 A percentage of a sprint. label Jan 27, 2023
@pdurbin pdurbin added this to the 5.13 milestone Jan 27, 2023
@pdurbin
Copy link
Member Author

pdurbin commented Jan 27, 2023

@siacus and I just discussed this issue. He wants in the current sprint, so I added it.

Again, it's a 3 (or less) just a curl command to add the JSON file above to enable Binder as an external tool.

@sbarbosadataverse head up that this will be a new feature for Harvard Dataverse! The same one I demo'ed for you the other day.

He also expressed that he'd like the following issue to be done during this sprint too:

That's the one that had already been put in the sprint.

@pdurbin
Copy link
Member Author

pdurbin commented Jan 27, 2023

@landreev added the button (thanks!!) and I just tried it out on my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP

It worked great! Easy as 1, 2, 3. Here are some screenshots:

Screen Shot 2023-01-27 at 12 43 37 PM

Screen Shot 2023-01-27 at 12 43 45 PM

Screen Shot 2023-01-27 at 12 44 02 PM

@landreev landreev removed their assignment Jan 27, 2023
@landreev
Copy link
Collaborator

should I click "close" on this, or is there something else that needs to be done?

@pdurbin
Copy link
Member Author

pdurbin commented Jan 27, 2023

@landreev from my perspective, we're all set! I say close it.

When I was demo'ing this to @siacus this morning I was saying we could consider this a "soft launch" of the Binder button in Harvard Dataverse. That is, we could ping people here and there who we think might be interested.

It occurs to me that I should try one of @atrisovic 's datasets, since she wrote the book on reproducibility and Dataverse. 😄

I'm able to step through (execute) the Jupyter notebook in her "Word cloud of Reproducibility and Replicability in Science" dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HOLVXA

(Note that she already added, years ago, an HTML Binder button manually with "Run this code on Jupyter Binder here".)

Here are some screenshots:

Screen Shot 2023-01-27 at 2 42 44 PM
Screen Shot 2023-01-27 at 2 42 50 PM
Screen Shot 2023-01-27 at 2 43 11 PM
Screen Shot 2023-01-27 at 2 43 44 PM

@sbarbosadataverse this seems like a good one to demo, much better than my "Open Source at Harvard" dataset (but I'll try to improve based on Ana's teachings!). 😄

The other dataset by @atrisovic I tried was "Replication Data for: Repository approaches to improving quality of shared data and code" at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EA3LC5

At the moment, I'm getting a weird "too many redirects" error so maybe Binder is having some trouble:

Screen Shot 2023-01-27 at 2 51 46 PM

@landreev anyway, I'd say it's fine to close this issue. Thanks again!

@mreekie mreekie added the pm.netcdf-hdf5.d NIH ODSS supplement (NetCDF) label Mar 30, 2023
@mreekie
Copy link
Collaborator

mreekie commented Mar 30, 2023

grooming:

  • added the netcdf deliverable tag.

@ekcomputer
Copy link

Is there a reason why some datasets don't have a Binder option? e.g. https://dataverse.harvard.edu/file.xhtml?fileId=7200392&version=1.0

@pdurbin
Copy link
Member Author

pdurbin commented Nov 21, 2023

@ekcomputer you have to navigate to the parent (from the data file to the dataset) to see the Binder button:

Screenshot 2023-11-21 at 1 28 37 PM

The terminology (#4274) can be confusing.

  • data files are in datasets
  • datasets are in collections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.netcdf-hdf5.d NIH ODSS supplement (NetCDF) Size: 3 A percentage of a sprint.
Projects
Status: No status
Development

No branches or pull requests

8 participants