Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stephen's edits #54

Merged
merged 3 commits into from
Apr 2, 2020
Merged

Stephen's edits #54

merged 3 commits into from
Apr 2, 2020

Conversation

sje30
Copy link
Collaborator

@sje30 sje30 commented Mar 11, 2020

hi daniel @nuest
here are some edits!

this is a mix of simple typos and comments.
search for my initials for the comments, which should also be in [...]
and then ask me to explain if anything unclear.
@@ -83,8 +88,7 @@ The differences between a helpful, stable `Dockerfile` and one that is misleadin
A commitment to this article's rules can ensure that workflows are reproducible and reusable, that computing environments are understandable by others, and researchers can collaborate effectively.
Their application should not be triggered by the publication of a finished project, but be weaved into day-to-day habits (cf. thoughts on openness as an afterthought by @chen_open_2019 and on computational reproducibility by @donoho_invitation_2010).

To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`.
This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.
To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`. (A key constraint is Docker containers can only run open source software, so environments like Mathematica and Matlab cannot currently be easily used within containers.) This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, in my experience most (what I would consider robust scientific workflows) don't get executed with one python or R script. It's more likely to see:

snakemake --use-conda <other params>
nextflow workflow.nf --in 'dataset/*.fa'
java -jar cromwell-XY.jar run myWorkflow.wdl

At least for workflows with sophisticted DAGS, we've gone well beyond <executable> <run my script> - the above abstracts away / misleads that there isn't complexity in suggesting that workflows are started like that. See https://docs.google.com/spreadsheets/d/1plkAsT_S3CzSeb7ivxyjRnHyrK3JclUCXeUMf_azraY/edit#gid=0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also contradict ourselves by saying it opens a GUI but then directly after that GUIs are not reproducible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your comments today, v. helpful.
I agree that many workflows may use a variety of tools to assemble the results. Having an example Dockerfile showing that would be good. (Most of mine tend to be at the simpler end, with just some R scripts that need running.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch I've started an example Dockerfile in the latest draft - not sure how to capture your thoughts in the example though. I totally agree we should not oversimplify things!

I did make the sentence a bit more encompassing though, please check.

databases, you can readily find containers [SJE: where do you look to find containers?] that come installed with
all the software that you need. Second, in the case that there isn't
an existing container for your needs, you might next look to
well-maintained tools to help with `Dockerfile` generation. These
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question and mostly a mess - I think people do Google searches and wind up at container registries, and the registries take them to GitHub somewhere to inspect the container recipe. Often the container recipes aren't provided and they use the base verbatim anyway.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cover that in the introduction, where we say: because people look everywhere, we wrote this article.


## Tools for container generation

Repo2docker [@jupyter_binder_2018] is a tool that is maintained by [Project Jupyter](https://jupyter.org/) that can help to transform a repository with common simple configuration files for defining software dependencies and versioninto a container.
Repo2docker [@jupyter_binder_2018] is a tool that is maintained by [Project Jupyter](https://jupyter.org/) that can help to transform a [SJE github/gitlab? any other repos?] repository with common simple configuration files for defining software dependencies and versioninto a container.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From most places that people would use for version control, yes to all of those, see https://repo2docker.readthedocs.io/en/latest/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick global comment - we need to establish early on the scope (small data science scripts) of this paper. We cannot open with a flowery and misleading statement about reproducible workflows because we don’t get anywhere near discussing scale, workflow execution, and generally more complex (real world) pipelines that use multiple containers, might interact with TB of data, and be run at scale with some cloud provider or HPC job manager or similar. We don’t want someone reading this that falls into that boot camp to roll their eyes because they intuit that the authors don’t have a clue about this scale of work.

agree re: this. Perhaps we could have a section at the start regarding a typical reader that might benefit from these tips?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! @nuest if you want to ping me when the rounds of edits / changes are done, I can do a final read over and I'll write this scoped section, which I think is super important. I'll open an issue and you can assign me to it so that we don't forget :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#55

@@ -144,6 +162,8 @@ It can provide a starting point for users unfamiliar with writing `Dockerfile`s,
# 2. Use versioned and automatically built base images {-}
\rulelabel{2}{rule:base}

[SJE: this feels WAY too long. can't we say it much shorter, i.e. it is good to pin version numbers if you want a stable archive of reproducible code, or use "latest" to check that your code continues to work as other software environments evolve. I think having a table here with some recommendations for e.g. python, R and julia, base images would go a long way to practically serve about 80% + of data scienceusers? The bullet pointed list at the end of the section is probably most useful.]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a table - the state of versions change so quickly it would be outdated before the paper comes out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sje30 that it's too long. What seems out of scope for this section is going from tags /versions to talking about automated builds. I would nix that content or move it elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: table. I'm okay about not having it. But if the table will get out of date, presumably so will the example Dockerfiles that we provide?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most definitely - although they are provided in a repository so (in the case that the repository is kept up to date) the paper linking to them would be as well. The references to tags (e.g., we reference a python tag) will definitely be dated, but I see this as different from a table, because a table is intended to be as a lookup / reference, which we don't want to provide. This is different from mentioning an example in the text that a user won't come back looking for.

@@ -292,7 +316,7 @@ A recommended ordering based on these considerations is as follows.
4. labels
5. `RUN`/`ENTRYPOINT`

Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container.
Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container. [SJE: really? Doesn't that mean you end up writing the instructions twice -- once in the Dockerfile as a comment, and once in the README? Surely the README could just say: "Please read the end of the Dockerfile for instructions on how to run the docker image."]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the redundancy is good, although problematic for the points you raise. Some users will read the Dockerfile, and others the README, and sometimes the Dockerfile is separated from the README. It's messy at best.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think referencing the instructions from the README is a nice idea. It also forces readers to take a look at the Dockerfile and know that it is there. OTOH it's a matter of taste, so maybe we just point out options and risks?

Finally, as a supplement to content inside the Dockerfile, it is good practice to also write a section in a README about how to use the container. This section may duplicate the documentation from the end of the Dockerfile at the risk of diverging and inconsistency, or simply point the reader to the respective lines in the Dockerfile.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@@ -355,7 +379,7 @@ It is common to install all software as the default user `root`.
In fact it is _required_ for system dependencies.
You can switch the user running subsequent `RUN`, `CMD` and `ENTRYPOINT` instructions with the `USER` instruction, but you have to make sure yourself that the user exists within the image.
This is useful if a software aborts installations as the `root` user, or to avoid permission problems when using the container (see&nbsp;\ruleref{rule:mount}).
Be sure to document a non-default user in detail and to examine the used users in base images if you encounter errors on user permissions.
Be sure to document a non-default user in detail and to examine the used users in base images if you encounter errors on user permissions. [SJE: okay, this para confused me -- just tell me what to do! should I not bother creating a user, and run as root, or should I create a user? rstudio complictes matters here I think, don't they create a "rstudio" username?]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most scientific containers that run on HPC, the conversion to Singularity makes the user directive useless. For docker containers that run on the host, using a user with specific permissions usually makes it harder for the person to get it working. If you are going to use Docker and take the security implications with that, I think the path to least resistance is using standard root, but not installing things in a way that would break if run in a user space (e.g., no executables in root's home, etc.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Rocker images have an rstudio user. Not having a non-root user can also be pretty annoying if you mount files into containers and cannot delete them (without sudo-ing) on the host afterwards...

I like the suggestion ("make sure root works"), and we should update the text accordingly. At the same time we should concisely mentioning the potential complications of users, because the reader might come across them via used base images.

@@ -418,6 +442,9 @@ For example, software versions, maintainer contact information, along with vendo
The OCI Image Format Specification provides some common label keys (see the "Annotations" section in @opencontainers_image-spec_2017) to help standardize field names across container tools, as shown below.
These labels match the `org.label-schema.*` specification, which has been deprecated in favour or the new namespace but are still found a lot in existing containers.


[SJE: this was new to me. On the one hand, I like the idea of LABEL, but have they really been adopted by the community? The document from 2017 cited above hasn't moved on since Jun 2017. Is this really a thing? And is the Dockerfile the right place for all this metadata? It feels like a bit too fine-grained for my liking. But if this is good practice, then I shall start adopting it! ]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LABEL directives are slow to be adopted, not adopted by most users for quick work, but if you look around they are used. The problem with labels isn't specific to labels, but reflects an overall systemic problem of people not documenting / setting metadata generally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm guilty of not providing metadata, but part of doing this is to get people (including me) to adopt good habits, so I'll learn...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whatever rules I come up with providing metadata, users are likely to ignore them :-) So I gave up the extra rule for "metadata", and

  1. moved the "build arguments" content to the "Favour clarity" rule and stopped recommending them.
  2. shortend the Label-content and moved it up to the "document" rule
  3. Moved the "formatting" part out of the new "document" rule

A lot of reorganising trigger by this comment :-) Hope you agree it is an improvement.

@@ -473,6 +500,9 @@ With plain Docker, you can override the defaults as part of the `docker run` com
In any case you should document different variants, if you choose to provide them, in a `Makefile` [@wikipedia_contributors_make_2019].
To support both one click execution and interactive interfaces and even allow for custom configuration, it's helpful to expose settings via a configuration file which can be bound from the host, via environment variables [@knoth_reproducibility_2017], or special Docker-based wrappers such as Kliko [@molenaar_klikoscientific_2018].


[SJE: I think I get the two style of working -- batch and interactive might be my way of thinking about them. But it feels a bit clunky above and certainly too detailed. might be better served with two example dockerfiles: once doing a batch analysis, and the other whose goal is to allow some interaction.]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit contradictory to talk about reproducible execution and then advocate for any graphical user interface. If someone brings a sys admin a python notebook someone will undoubtably roll in their grave.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the example though of a docker container that generates a pdf (from a knitr doucument) but then also allows the user to run rstudio to further edit the knitr doc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I understand what you mean - yes that's a very good point! Just don't run a rockr rstudio container that is converted to singularity, results in a very bad outcome :P

@@ -524,6 +556,9 @@ This variety of approaches render seemingly more convenient uncontainerised envi
# 8. Establish templates for new projects {-}
\rulelabel{8}{rule:templates}


[SJE: this seemed a bit lacking in meat, compared to the other rules. Do you have an example Dockerfile template that you are thinking of?]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule says to me essentially what we've said in another rule - "Don't start with a blank file, copy paste something that is almost what you want." I would nix this section and replace with something about automation to clean up / shorten the section that is too long.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the "rich" ideas to other sections and dropped this rule.

@@ -617,14 +656,14 @@ Together we can develop common practices and shared materials for better transpa

# Acknowledgements {#acknowledgements .unnumbered}

DN is supported by the project Opening Reproducible Research II (\href{https://o2r.info/}{https://o2r.info/}; \href{https://www.uni-muenster.de/forschungaz/project/12343}{https://www.uni-muenster.de/forschungaz/project/12343}) funded by the German Research Foundation (DFG) under project number PE 1632/17-1.
DN is supported by the project Opening Reproducible Research II (\href{https://o2r.info/}{https://o2r.info/}; \href{https://www.uni-muenster.de/forschungaz/project/12343}{https://www.uni-muenster.de/forschungaz/project/12343}) funded by the German Research Foundation (DFG) under project number PE 1632/17-1. DN and SJE are supported by a Mozilla mini science grant.

# Contributions {#contributions .unnumbered}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch is running on fumes because she is a research software engineer with no way to self-fund other than begging for... oh wait nevermind. :)

@vsoch
Copy link
Collaborator

vsoch commented Mar 11, 2020

One quick global comment - we need to establish early on the scope (small data science scripts) of this paper. We cannot open with a flowery and misleading statement about reproducible workflows because we don’t get anywhere near discussing scale, workflow execution, and generally more complex (real world) pipelines that use multiple containers, might interact with TB of data, and be run at scale with some cloud provider or HPC job manager or similar. We don’t want someone reading this that falls into that boot camp to roll their eyes because they intuit that the authors don’t have a clue about this scale of work.

@nuest nuest changed the title Sje Stephen's edits Mar 12, 2020
@nuest
Copy link
Owner

nuest commented Mar 12, 2020

Thanks @sje30 for the comments and pointers, and @vsoch for already pushing ideas further. I think there are some really good ideas for improvement here!

@vsoch How should we proceed? I'd suggest to merge this as is, and then we can take on the concrete suggestions made: You could do the intro (#55) and I could handle what Stephen and you discussed (working on the "[SJE:...]" stuff), including the suggestion to sacrifice the "template" rule for splitting up another long rule. Both of us can

Should we do this with HackMD's GitHub integration, so we can work simultaneously?

@vsoch
Copy link
Collaborator

vsoch commented Mar 12, 2020

I agree to merge, and my preference would be to do my changes after you are finished. Don’t worry about time, I can be pretty speedy.

@nuest nuest merged commit dc07543 into master Apr 2, 2020
databases, you can readily find containers [SJE: where do you look to find containers?] that come installed with
all the software that you need. Second, in the case that there isn't
an existing container for your needs, you might next look to
well-maintained tools to help with `Dockerfile` generation. These
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cover that in the introduction, where we say: because people look everywhere, we wrote this article.

@@ -83,8 +88,7 @@ The differences between a helpful, stable `Dockerfile` and one that is misleadin
A commitment to this article's rules can ensure that workflows are reproducible and reusable, that computing environments are understandable by others, and researchers can collaborate effectively.
Their application should not be triggered by the publication of a finished project, but be weaved into day-to-day habits (cf. thoughts on openness as an afterthought by @chen_open_2019 and on computational reproducibility by @donoho_invitation_2010).

To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`.
This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.
To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`. (A key constraint is Docker containers can only run open source software, so environments like Mathematica and Matlab cannot currently be easily used within containers.) This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch I've started an example Dockerfile in the latest draft - not sure how to capture your thoughts in the example though. I totally agree we should not oversimplify things!

I did make the sentence a bit more encompassing though, please check.

@@ -292,7 +316,7 @@ A recommended ordering based on these considerations is as follows.
4. labels
5. `RUN`/`ENTRYPOINT`

Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container.
Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container. [SJE: really? Doesn't that mean you end up writing the instructions twice -- once in the Dockerfile as a comment, and once in the README? Surely the README could just say: "Please read the end of the Dockerfile for instructions on how to run the docker image."]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@@ -418,6 +442,9 @@ For example, software versions, maintainer contact information, along with vendo
The OCI Image Format Specification provides some common label keys (see the "Annotations" section in @opencontainers_image-spec_2017) to help standardize field names across container tools, as shown below.
These labels match the `org.label-schema.*` specification, which has been deprecated in favour or the new namespace but are still found a lot in existing containers.


[SJE: this was new to me. On the one hand, I like the idea of LABEL, but have they really been adopted by the community? The document from 2017 cited above hasn't moved on since Jun 2017. Is this really a thing? And is the Dockerfile the right place for all this metadata? It feels like a bit too fine-grained for my liking. But if this is good practice, then I shall start adopting it! ]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whatever rules I come up with providing metadata, users are likely to ignore them :-) So I gave up the extra rule for "metadata", and

  1. moved the "build arguments" content to the "Favour clarity" rule and stopped recommending them.
  2. shortend the Label-content and moved it up to the "document" rule
  3. Moved the "formatting" part out of the new "document" rule

A lot of reorganising trigger by this comment :-) Hope you agree it is an improvement.

@@ -601,6 +638,8 @@ You can do this shortly before submitting your reproducible workflow for peer-re

To demonstrate the ten rules, we maintain a GitHub repository with example `Dockerfile`s, some of which we took from public repositories and updated to adhere to the rules (see `Dockerfile.original`): [https://github.com/nuest/ten-simple-rules-dockerfiles/](https://github.com/nuest/ten-simple-rules-dockerfiles/)

[SJE: I would like to see these annotated. Perhaps this is something I can meaningfully do?]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could take a stab at finding Dockerfiles we can both "test" the rules and demonstrate them. See also #4

@@ -524,6 +556,9 @@ This variety of approaches render seemingly more convenient uncontainerised envi
# 8. Establish templates for new projects {-}
\rulelabel{8}{rule:templates}


[SJE: this seemed a bit lacking in meat, compared to the other rules. Do you have an example Dockerfile template that you are thinking of?]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the "rich" ideas to other sections and dropped this rule.

@@ -584,6 +619,8 @@ Third, from time to time you can reduce the system resources occupied by Docker
After a prune is performed, it follows naturally to rebuild a container for local usage, or to pull it again from a newly built registry image.
This habit can be automated with a cron job [@wikipedia_contributors_cron_2019].


[SJE: "sure, okay, I'd like to export an image. Give me a recipe to show me how to do it, e.g. for this paper!"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sje30 Do you suggest to export an image to render the article to Zenodo and use that as a live example here?

@@ -490,7 +520,9 @@ docker run -p 8888:8888 jupyter/datascience-notebook:7a0c7325e470
[I 15:44:31.323 NotebookApp] Use Control-C to stop this server and [..]
```

A person who is unfamiliar with Docker but wants to use your image may rely on graphical tools like Kitematic [@docker_kitematic_2019] or [ContainDS](https://containds.com/) for assisstance in managing containers on their machine without using the Docker CLI.
[SJE: that's good, does rstudio do similar? need to check.]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sje30 No, rstudio containers do not do that, you always have to login manually.

nuest added a commit that referenced this pull request Apr 2, 2020
@nuest
Copy link
Owner

nuest commented Apr 2, 2020

@sje30 @vsoch I added a few comments to document my changes, which are now in the master branch: #54 (review)

@vsoch Can you do the next iteration? IMO the structure improved thanks to you and Stephen discussing, and now we need to cut out some content to make the size manageable.

Let me know when you will do this - if you can only do it next week then I'll try to resolve some of the open issues, namely #16 #17 #18 #22

@vsoch
Copy link
Collaborator

vsoch commented Apr 2, 2020

sure! I can do another pass over and PR today. Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants