-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Binder button for operating on datasets with Jupyter notebooks, Python, R, etc. #208
Comments
This is different from running a container with a proper workflow inside. The Binder container probably contains some preinstalled version of python/R and libs. Your script may work or not, or work but producing different result. The Binder approach can be seen more as an exploratory tool than a reproducibility feature (which is what I am planning to have for DV). Definitely interesting. |
You piqued my interest when you mentioned "reproducibility". Maybe you'd be interested in exploring Whole Tale, which integrates with DV directly and is "reproducibility-oriented" ? |
@siacus Binder might be more flexible than you imagine. 😄 You can definitely specify the version of Python/R and libs:
You can even specify your own Dockerfile: And here are some docs on reproducibility: That said, I'd certainly be happy to enable a Whole Tale button on Harvard Dataverse too, like @Xarthisius suggests! We already enabled it on demo: Craig Willis recorded a couple nice videos about using Whole Tale with Dataverse: |
@siacus I'd certainly be interested in hearing your thoughts on reproducibility and Dataverse (maybe outside of this issue, if interested). The Whole Tale project has been working on this for a while, but never fully closed the loop with Dataverse. (@Xarthisius Binder/DV integration actually comes out of WT.) I don't want to hijack the issue here, but would be happy to share information about the similarities/differences between the platforms. |
Hi All, very happy to see we are bringing back this discussion, and that @siacus is thinking about reproducibility. I have thought about it and have a lot of opinions on it! I just left a comment on #6807 and this is what I'm thinking: We can enable proactive reproducibility in Dataverse at dataset upload stage by either having a prompt to ask the uploader for external dependencies or by adding a standard Dockerfile based on what is in the dataset. I can expand on this later as it is not immediately related to Binder. However, Dataverse already has thousands of 'replication datasets' that contain code without an environment. This is why we should think of ways to enable retroactive reproducibility as best as we can. This can already be facilitated with Binder. In particular, for datasets that contain R/Python code or specific file formats (NetCDF/parquet/Rdata) it would be very useful to get a Dockerfile/env.yml on the fly so that a ready-to-go environment is initiated in Binder. I wonder if there is a spot in the Dataverse transfer to Binder where that could be smoothly implemented. |
@atrisovic a while ago I experimented with opening Binder with a specific version of R and the right libs. Then I tried with Python:
Definitely an (old) work in progress, so I'm not sure if it even works anymore but in short for R you can have a file called @craig-willis I'm just waiting for you to invite me back to Chicago for another reproducibility workshop! 🎉 |
@siacus and I just discussed this issue. He wants in the current sprint, so I added it. Again, it's a 3 (or less) just a curl command to add the JSON file above to enable Binder as an external tool. @sbarbosadataverse head up that this will be a new feature for Harvard Dataverse! The same one I demo'ed for you the other day. He also expressed that he'd like the following issue to be done during this sprint too: That's the one that had already been put in the sprint. |
@landreev added the button (thanks!!) and I just tried it out on my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP It worked great! Easy as 1, 2, 3. Here are some screenshots: |
should I click "close" on this, or is there something else that needs to be done? |
@landreev from my perspective, we're all set! I say close it. When I was demo'ing this to @siacus this morning I was saying we could consider this a "soft launch" of the Binder button in Harvard Dataverse. That is, we could ping people here and there who we think might be interested. It occurs to me that I should try one of @atrisovic 's datasets, since she wrote the book on reproducibility and Dataverse. 😄 I'm able to step through (execute) the Jupyter notebook in her "Word cloud of Reproducibility and Replicability in Science" dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HOLVXA (Note that she already added, years ago, an HTML Binder button manually with "Run this code on Jupyter Binder here".) Here are some screenshots: @sbarbosadataverse this seems like a good one to demo, much better than my "Open Source at Harvard" dataset (but I'll try to improve based on Ana's teachings!). 😄 The other dataset by @atrisovic I tried was "Replication Data for: Repository approaches to improving quality of shared data and code" at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EA3LC5 At the moment, I'm getting a weird "too many redirects" error so maybe Binder is having some trouble: @landreev anyway, I'd say it's fine to close this issue. Thanks again! |
grooming:
|
Is there a reason why some datasets don't have a Binder option? e.g. https://dataverse.harvard.edu/file.xhtml?fileId=7200392&version=1.0 |
@ekcomputer you have to navigate to the parent (from the data file to the dataset) to see the Binder button: The terminology (#4274) can be confusing.
|
Almost three years ago we made a big push thanks to @Xarthisius to teach https://mybinder.org to handle Dataverse datasets.
For anyone not familiar, Binder gives you an environment for computational reproducibility in the cloud, for free!
If there is code in the dataset, great, you can try to execute that code in Binder.
If there is no code in the dataset, no problem, by launching the dataset in Binder, you have an environment in which you can start exploring and writing code.
To be clear, the dataset author doesn't need to use Binder. Anyone wanting to explore the data (especially with code) can use Binder.
@siacus and I have been talking about containers (which Binder spins up for you) and I just gave a demo to @sbarbosadataverse on how we can add a Binder button to every dataset in Harvard Dataverse by loading up the Binder tool via curl (like any external tool).
First some screenshots. On a dev server, here's a copy of my "Open Source at Harvard" dataset looks. Note that a "Binder" button is shown.
When you click the "Binder" button, in a new tab you'll see Binder spinning up a Docker container with the dataset (code, data, docs, etc.) in it…
This dataset happens to have two directories, code and data:
As a proof of concept, here's the execution of a Python script but the sky is the limit it terms of which languages and tools you want to run, such as Jupyter notebooks:
To load up the tool, use curl like usual
curl -X POST -H 'Content-type: application/json' http://localhost:8080/api/admin/externalTools --upload-file binder.json
Here's the external tool manifest:
(We are looking for a permanent home for this manifest at data-exp-lab/girder_ythub#10 which should help close IQSS/dataverse#6807 .)
The text was updated successfully, but these errors were encountered: