Add Binder button for operating on datasets with Jupyter notebooks, Python, R, etc. #208

pdurbin · 2023-01-19T20:08:28Z

Almost three years ago we made a big push thanks to @Xarthisius to teach https://mybinder.org to handle Dataverse datasets.

For anyone not familiar, Binder gives you an environment for computational reproducibility in the cloud, for free!

If there is code in the dataset, great, you can try to execute that code in Binder.

If there is no code in the dataset, no problem, by launching the dataset in Binder, you have an environment in which you can start exploring and writing code.

To be clear, the dataset author doesn't need to use Binder. Anyone wanting to explore the data (especially with code) can use Binder.

@siacus and I have been talking about containers (which Binder spins up for you) and I just gave a demo to @sbarbosadataverse on how we can add a Binder button to every dataset in Harvard Dataverse by loading up the Binder tool via curl (like any external tool).

First some screenshots. On a dev server, here's a copy of my "Open Source at Harvard" dataset looks. Note that a "Binder" button is shown.

When you click the "Binder" button, in a new tab you'll see Binder spinning up a Docker container with the dataset (code, data, docs, etc.) in it…

This dataset happens to have two directories, code and data:

As a proof of concept, here's the execution of a Python script but the sky is the limit it terms of which languages and tools you want to run, such as Jupyter notebooks:

To load up the tool, use curl like usual

curl -X POST -H 'Content-type: application/json' http://localhost:8080/api/admin/externalTools --upload-file binder.json

Here's the external tool manifest:

{
  "displayName": "Binder",
  "description": "Run on Binder",
  "scope": "dataset",
  "type": "explore",
  "toolUrl": "https://girder.hub.yt/api/v1/ythub/dataverse",
  "toolParameters": {
    "queryParameters": [
      {
        "datasetPid": "{datasetPid}"
      },
      {
        "siteUrl": "{siteUrl}"
      },
      {
        "key": "{apiToken}"
      }
    ]
  }
}

(We are looking for a permanent home for this manifest at data-exp-lab/girder_ythub#10 which should help close IQSS/dataverse#6807 .)

The text was updated successfully, but these errors were encountered:

siacus · 2023-01-19T22:20:06Z

This is different from running a container with a proper workflow inside. The Binder container probably contains some preinstalled version of python/R and libs. Your script may work or not, or work but producing different result. The Binder approach can be seen more as an exploratory tool than a reproducibility feature (which is what I am planning to have for DV). Definitely interesting.

Xarthisius · 2023-01-19T22:51:34Z

This is different from running a container with a proper workflow inside. The Binder container probably contains some preinstalled version of python/R and libs. Your script may work or not, or work but producing different result. The Binder approach can be seen more as an exploratory tool than a reproducibility feature (which is what I am planning to have for DV). Definitely interesting.

You piqued my interest when you mentioned "reproducibility". Maybe you'd be interested in exploring Whole Tale, which integrates with DV directly and is "reproducibility-oriented" ?

pdurbin · 2023-01-19T23:08:19Z

@siacus Binder might be more flexible than you imagine. 😄

You can definitely specify the version of Python/R and libs:

You can even specify your own Dockerfile:

https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html

And here are some docs on reproducibility:

https://mybinder.readthedocs.io/en/latest/tutorials/reproducibility.html

That said, I'd certainly be happy to enable a Whole Tale button on Harvard Dataverse too, like @Xarthisius suggests! We already enabled it on demo:

Add Whole Tale to Dataverse demo site dataverse#6446

Craig Willis recorded a couple nice videos about using Whole Tale with Dataverse:

craig-willis · 2023-01-19T23:32:34Z

@siacus I'd certainly be interested in hearing your thoughts on reproducibility and Dataverse (maybe outside of this issue, if interested). The Whole Tale project has been working on this for a while, but never fully closed the loop with Dataverse. (@Xarthisius Binder/DV integration actually comes out of WT.)

I don't want to hijack the issue here, but would be happy to share information about the similarities/differences between the platforms.

atrisovic · 2023-01-20T19:10:31Z

Hi All, very happy to see we are bringing back this discussion, and that @siacus is thinking about reproducibility. I have thought about it and have a lot of opinions on it!

I just left a comment on #6807 and this is what I'm thinking:

We can enable proactive reproducibility in Dataverse at dataset upload stage by either having a prompt to ask the uploader for external dependencies or by adding a standard Dockerfile based on what is in the dataset. I can expand on this later as it is not immediately related to Binder.

However, Dataverse already has thousands of 'replication datasets' that contain code without an environment. This is why we should think of ways to enable retroactive reproducibility as best as we can. This can already be facilitated with Binder. In particular, for datasets that contain R/Python code or specific file formats (NetCDF/parquet/Rdata) it would be very useful to get a Dockerfile/env.yml on the fly so that a ready-to-go environment is initiated in Binder. I wonder if there is a spot in the Dataverse transfer to Binder where that could be smoothly implemented.

pdurbin · 2023-01-20T20:59:52Z

@atrisovic a while ago I experimented with opening Binder with a specific version of R and the right libs. Then I tried with Python:

Definitely an (old) work in progress, so I'm not sure if it even works anymore but in short for R you can have a file called install.R and for Python you can have requirements.txt. That said, I'd check the Binder docs for the latest. 😄

@craig-willis I'm just waiting for you to invite me back to Chicago for another reproducibility workshop! 🎉

pdurbin · 2023-01-27T15:51:28Z

@siacus and I just discussed this issue. He wants in the current sprint, so I added it.

Again, it's a 3 (or less) just a curl command to add the JSON file above to enable Binder as an external tool.

@sbarbosadataverse head up that this will be a new feature for Harvard Dataverse! The same one I demo'ed for you the other day.

He also expressed that he'd like the following issue to be done during this sprint too:

Explore button for Binder dataverse#6807

That's the one that had already been put in the sprint.

pdurbin · 2023-01-27T17:49:58Z

@landreev added the button (thanks!!) and I just tried it out on my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP

It worked great! Easy as 1, 2, 3. Here are some screenshots:

landreev · 2023-01-27T19:36:57Z

should I click "close" on this, or is there something else that needs to be done?

pdurbin · 2023-01-27T19:53:09Z

@landreev from my perspective, we're all set! I say close it.

When I was demo'ing this to @siacus this morning I was saying we could consider this a "soft launch" of the Binder button in Harvard Dataverse. That is, we could ping people here and there who we think might be interested.

It occurs to me that I should try one of @atrisovic 's datasets, since she wrote the book on reproducibility and Dataverse. 😄

I'm able to step through (execute) the Jupyter notebook in her "Word cloud of Reproducibility and Replicability in Science" dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HOLVXA

(Note that she already added, years ago, an HTML Binder button manually with "Run this code on Jupyter Binder here".)

Here are some screenshots:

@sbarbosadataverse this seems like a good one to demo, much better than my "Open Source at Harvard" dataset (but I'll try to improve based on Ana's teachings!). 😄

The other dataset by @atrisovic I tried was "Replication Data for: Repository approaches to improving quality of shared data and code" at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EA3LC5

At the moment, I'm getting a weird "too many redirects" error so maybe Binder is having some trouble:

@landreev anyway, I'd say it's fine to close this issue. Thanks again!

mreekie · 2023-03-30T19:12:15Z

grooming:

added the netcdf deliverable tag.

ekcomputer · 2023-11-21T17:36:50Z

Is there a reason why some datasets don't have a Binder option? e.g. https://dataverse.harvard.edu/file.xhtml?fileId=7200392&version=1.0

pdurbin · 2023-11-21T18:29:39Z

@ekcomputer you have to navigate to the parent (from the data file to the dataset) to see the Binder button:

The terminology (#4274) can be confusing.

data files are in datasets
datasets are in collections

pdurbin mentioned this issue Jan 19, 2023

Explore button for Binder IQSS/dataverse#6807

Closed

pdurbin mentioned this issue Jan 20, 2023

Add Whole Tale button for operating on datasets with Jupyter notebooks, Python, R, etc. #208 #209

Open

pdurbin added the Size: 3 A percentage of a sprint. label Jan 27, 2023

pdurbin added this to the 5.13 milestone Jan 27, 2023

pdurbin assigned landreev Jan 27, 2023

pdurbin mentioned this issue Jan 27, 2023

Find a place for DV external tool manifest redirecting for binder data-exp-lab/girder_ythub#10

Closed

mreekie added this to IQSS Dataverse Project Jan 27, 2023

mreekie moved this to 4️⃣▶⏱In This Sprint in IQSS Dataverse Project Jan 27, 2023

landreev removed their assignment Jan 27, 2023

pdurbin mentioned this issue Jan 30, 2023

in the docs, add Binder as an external tool #6807 IQSS/dataverse#9341

Merged

landreev self-assigned this Jan 30, 2023

landreev closed this as completed Jan 30, 2023

landreev removed their assignment Jan 30, 2023

pdurbin mentioned this issue Feb 1, 2023

Files used in notebooks are not all in the Dataverse yet AIRCentre/JuliaEO#36

Open

mreekie moved this from ▶Sprint Kickoff! to 🚮Clear of the Backlog in IQSS Dataverse Project Feb 6, 2023

This was referenced Feb 7, 2023

Dataverse content provider: download files in original format jupyterhub/repo2docker#1242

Closed

Fix Binder and Whole Tale (repo2docker) to download original files rather than archival .tab files IQSS/dataverse#9374

Closed

mreekie added the pm.netcdf-hdf5.d NIH ODSS supplement (NetCDF) label Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Binder button for operating on datasets with Jupyter notebooks, Python, R, etc. #208

Add Binder button for operating on datasets with Jupyter notebooks, Python, R, etc. #208

pdurbin commented Jan 19, 2023

siacus commented Jan 19, 2023

Xarthisius commented Jan 19, 2023

pdurbin commented Jan 19, 2023

craig-willis commented Jan 19, 2023

atrisovic commented Jan 20, 2023 •

edited by pdurbin

Loading

pdurbin commented Jan 20, 2023

pdurbin commented Jan 27, 2023

pdurbin commented Jan 27, 2023

landreev commented Jan 27, 2023

pdurbin commented Jan 27, 2023 •

edited

Loading

mreekie commented Mar 30, 2023

ekcomputer commented Nov 21, 2023

pdurbin commented Nov 21, 2023 •

edited

Loading

Add Binder button for operating on datasets with Jupyter notebooks, Python, R, etc. #208

Add Binder button for operating on datasets with Jupyter notebooks, Python, R, etc. #208

Comments

pdurbin commented Jan 19, 2023

siacus commented Jan 19, 2023

Xarthisius commented Jan 19, 2023

pdurbin commented Jan 19, 2023

craig-willis commented Jan 19, 2023

atrisovic commented Jan 20, 2023 • edited by pdurbin Loading

pdurbin commented Jan 20, 2023

pdurbin commented Jan 27, 2023

pdurbin commented Jan 27, 2023

landreev commented Jan 27, 2023

pdurbin commented Jan 27, 2023 • edited Loading

mreekie commented Mar 30, 2023

ekcomputer commented Nov 21, 2023

pdurbin commented Nov 21, 2023 • edited Loading

atrisovic commented Jan 20, 2023 •

edited by pdurbin

Loading

pdurbin commented Jan 27, 2023 •

edited

Loading

pdurbin commented Nov 21, 2023 •

edited

Loading