Stephen's edits #54

sje30 · 2020-03-11T10:55:55Z

hi daniel @nuest
here are some edits!

this is a mix of simple typos and comments. search for my initials for the comments, which should also be in [...] and then ask me to explain if anything unclear.

vsoch · 2020-03-11T11:05:38Z

ten-simple-rules-dockerfiles.Rmd

@@ -83,8 +88,7 @@ The differences between a helpful, stable `Dockerfile` and one that is misleadin
 A commitment to this article's rules can ensure that workflows are reproducible and reusable, that computing environments are understandable by others, and researchers can collaborate effectively.
 Their application should not be triggered by the publication of a finished project, but be weaved into day-to-day habits (cf. thoughts on openness as an afterthought by @chen_open_2019 and on computational reproducibility by @donoho_invitation_2010).

-To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`.
-This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.
+To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`. (A key constraint is Docker containers can only run open source software, so environments like Mathematica and Matlab cannot currently be easily used within containers.) This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.


hmm, in my experience most (what I would consider robust scientific workflows) don't get executed with one python or R script. It's more likely to see:

snakemake --use-conda <other params> nextflow workflow.nf --in 'dataset/*.fa' java -jar cromwell-XY.jar run myWorkflow.wdl

At least for workflows with sophisticted DAGS, we've gone well beyond <executable> <run my script> - the above abstracts away / misleads that there isn't complexity in suggesting that workflows are started like that. See https://docs.google.com/spreadsheets/d/1plkAsT_S3CzSeb7ivxyjRnHyrK3JclUCXeUMf_azraY/edit#gid=0

We also contradict ourselves by saying it opens a GUI but then directly after that GUIs are not reproducible.

Thanks for all your comments today, v. helpful.
I agree that many workflows may use a variety of tools to assemble the results. Having an example Dockerfile showing that would be good. (Most of mine tend to be at the simpler end, with just some R scripts that need running.)

@vsoch I've started an example Dockerfile in the latest draft - not sure how to capture your thoughts in the example though. I totally agree we should not oversimplify things!

I did make the sentence a bit more encompassing though, please check.

ten-simple-rules-dockerfiles.Rmd

vsoch · 2020-03-11T11:10:38Z

ten-simple-rules-dockerfiles.Rmd

+databases, you can readily find containers [SJE: where do you look to find containers?] that come installed with
+all the software that you need.  Second, in the case that there isn't
+an existing container for your needs, you might next look to
+well-maintained tools to help with `Dockerfile` generation.  These


This is a good question and mostly a mess - I think people do Google searches and wind up at container registries, and the registries take them to GitHub somewhere to inspect the container recipe. Often the container recipes aren't provided and they use the base verbatim anyway.

I think we cover that in the introduction, where we say: because people look everywhere, we wrote this article.

vsoch · 2020-03-11T11:12:15Z

ten-simple-rules-dockerfiles.Rmd


 ## Tools for container generation

-Repo2docker [@jupyter_binder_2018] is a tool that is maintained by [Project Jupyter](https://jupyter.org/) that can help to transform a repository with common simple configuration files for defining software dependencies and versioninto a container.
+Repo2docker [@jupyter_binder_2018] is a tool that is maintained by [Project Jupyter](https://jupyter.org/) that can help to transform a [SJE github/gitlab? any other repos?] repository with common simple configuration files for defining software dependencies and versioninto a container.


From most places that people would use for version control, yes to all of those, see https://repo2docker.readthedocs.io/en/latest/

One quick global comment - we need to establish early on the scope (small data science scripts) of this paper. We cannot open with a flowery and misleading statement about reproducible workflows because we don’t get anywhere near discussing scale, workflow execution, and generally more complex (real world) pipelines that use multiple containers, might interact with TB of data, and be run at scale with some cloud provider or HPC job manager or similar. We don’t want someone reading this that falls into that boot camp to roll their eyes because they intuit that the authors don’t have a clue about this scale of work.

agree re: this. Perhaps we could have a section at the start regarding a typical reader that might benefit from these tips?

Awesome! @nuest if you want to ping me when the rounds of edits / changes are done, I can do a final read over and I'll write this scoped section, which I think is super important. I'll open an issue and you can assign me to it so that we don't forget :)

vsoch · 2020-03-11T11:13:08Z

ten-simple-rules-dockerfiles.Rmd

@@ -144,6 +162,8 @@ It can provide a starting point for users unfamiliar with writing `Dockerfile`s,
 # 2. Use versioned and automatically built base images {-}
 \rulelabel{2}{rule:base}

+[SJE: this feels WAY too long.  can't we say it much shorter, i.e. it is good to pin version numbers if you want a stable archive of reproducible code, or use "latest" to check that your code continues to work as other software environments evolve.  I think having a table here with some recommendations for e.g. python, R and julia, base images would go a long way to practically serve about 80% + of data scienceusers? The bullet pointed list at the end of the section is probably most useful.]


I don't think we want a table - the state of versions change so quickly it would be outdated before the paper comes out.

I agree with @sje30 that it's too long. What seems out of scope for this section is going from tags /versions to talking about automated builds. I would nix that content or move it elsewhere.

re: table. I'm okay about not having it. But if the table will get out of date, presumably so will the example Dockerfiles that we provide?

Most definitely - although they are provided in a repository so (in the case that the repository is kept up to date) the paper linking to them would be as well. The references to tags (e.g., we reference a python tag) will definitely be dated, but I see this as different from a table, because a table is intended to be as a lookup / reference, which we don't want to provide. This is different from mentioning an example in the text that a user won't come back looking for.

vsoch · 2020-03-11T11:18:28Z

ten-simple-rules-dockerfiles.Rmd

@@ -292,7 +316,7 @@ A recommended ordering based on these considerations is as follows.
 4. labels
 5. `RUN`/`ENTRYPOINT`

-Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container. 
+Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container.  [SJE: really?  Doesn't that mean you end up writing the instructions twice -- once in the Dockerfile as a comment, and once in the README?  Surely the README could just say:  "Please read the end of the Dockerfile for instructions on how to run the docker image."]


Probably the redundancy is good, although problematic for the points you raise. Some users will read the Dockerfile, and others the README, and sometimes the Dockerfile is separated from the README. It's messy at best.

I think referencing the instructions from the README is a nice idea. It also forces readers to take a look at the Dockerfile and know that it is there. OTOH it's a matter of taste, so maybe we just point out options and risks?

Finally, as a supplement to content inside the Dockerfile, it is good practice to also write a section in a README about how to use the container. This section may duplicate the documentation from the end of the Dockerfile at the risk of diverging and inconsistency, or simply point the reader to the respective lines in the Dockerfile.

vsoch · 2020-03-11T11:20:10Z

ten-simple-rules-dockerfiles.Rmd

@@ -355,7 +379,7 @@ It is common to install all software as the default user `root`.
 In fact it is _required_ for system dependencies.
 You can switch the user running subsequent `RUN`, `CMD` and `ENTRYPOINT` instructions with the `USER` instruction, but you have to make sure yourself that the user exists within the image.
 This is useful if a software aborts installations as the `root` user, or to avoid permission problems when using the container (see&nbsp;\ruleref{rule:mount}).
-Be sure to document a non-default user in detail and to examine the used users in base images if you encounter errors on user permissions.
+Be sure to document a non-default user in detail and to examine the used users in base images if you encounter errors on user permissions.  [SJE: okay, this para confused me -- just tell me what to do!  should I not bother creating a user, and run as root, or should I create a user?  rstudio complictes matters here I think, don't they create a "rstudio" username?]


For most scientific containers that run on HPC, the conversion to Singularity makes the user directive useless. For docker containers that run on the host, using a user with specific permissions usually makes it harder for the person to get it working. If you are going to use Docker and take the security implications with that, I think the path to least resistance is using standard root, but not installing things in a way that would break if run in a user space (e.g., no executables in root's home, etc.)

Yes, Rocker images have an rstudio user. Not having a non-root user can also be pretty annoying if you mount files into containers and cannot delete them (without sudo-ing) on the host afterwards...

I like the suggestion ("make sure root works"), and we should update the text accordingly. At the same time we should concisely mentioning the potential complications of users, because the reader might come across them via used base images.

vsoch · 2020-03-11T11:21:26Z

ten-simple-rules-dockerfiles.Rmd

@@ -418,6 +442,9 @@ For example, software versions, maintainer contact information, along with vendo
 The OCI Image Format Specification provides some common label keys (see the "Annotations" section in @opencontainers_image-spec_2017) to help standardize field names across container tools, as shown below.
 These labels match the `org.label-schema.*` specification, which has been deprecated in favour or the new namespace but are still found a lot in existing containers.

+
+[SJE: this was new to me.  On the one hand, I like the idea of LABEL, but have they really been adopted by the community?  The document from 2017 cited above hasn't moved on since Jun 2017.  Is this really a thing?  And is the Dockerfile the right place for all this metadata?  It feels like a bit too fine-grained for my liking.  But if this is good practice, then I shall start adopting it!  ]
+


LABEL directives are slow to be adopted, not adopted by most users for quick work, but if you look around they are used. The problem with labels isn't specific to labels, but reflects an overall systemic problem of people not documenting / setting metadata generally.

Ok. I'm guilty of not providing metadata, but part of doing this is to get people (including me) to adopt good habits, so I'll learn...

I think whatever rules I come up with providing metadata, users are likely to ignore them :-) So I gave up the extra rule for "metadata", and

moved the "build arguments" content to the "Favour clarity" rule and stopped recommending them.

shortend the Label-content and moved it up to the "document" rule

Moved the "formatting" part out of the new "document" rule

A lot of reorganising trigger by this comment :-) Hope you agree it is an improvement.

vsoch · 2020-03-11T11:22:52Z

ten-simple-rules-dockerfiles.Rmd

@@ -473,6 +500,9 @@ With plain Docker, you can override the defaults as part of the `docker run` com
 In any case you should document different variants, if you choose to provide them, in a `Makefile` [@wikipedia_contributors_make_2019].
 To support both one click execution and interactive interfaces and even allow for custom configuration, it's helpful to expose settings via a configuration file which can be bound from the host, via environment variables [@knoth_reproducibility_2017], or special Docker-based wrappers such as Kliko [@molenaar_klikoscientific_2018].

+
+[SJE: I think I get the two style of working -- batch and interactive might be my way of thinking about them.  But it feels a bit clunky above and certainly too detailed.  might be better served with two example dockerfiles: once doing a batch analysis, and the other whose goal is to allow some interaction.]
+


I think it's a bit contradictory to talk about reproducible execution and then advocate for any graphical user interface. If someone brings a sys admin a python notebook someone will undoubtably roll in their grave.

I like the example though of a docker container that generates a pdf (from a knitr doucument) but then also allows the user to run rstudio to further edit the knitr doc.

ah I understand what you mean - yes that's a very good point! Just don't run a rockr rstudio container that is converted to singularity, results in a very bad outcome :P

vsoch · 2020-03-11T11:24:32Z

ten-simple-rules-dockerfiles.Rmd

@@ -524,6 +556,9 @@ This variety of approaches render seemingly more convenient uncontainerised envi
 # 8. Establish templates for new projects {-}
 \rulelabel{8}{rule:templates}

+
+[SJE: this seemed a bit lacking in meat, compared to the other rules.  Do you have an example Dockerfile template that you are thinking of?]
+


This rule says to me essentially what we've said in another rule - "Don't start with a blank file, copy paste something that is almost what you want." I would nix this section and replace with something about automation to clean up / shorten the section that is too long.

I've moved the "rich" ideas to other sections and dropped this rule.

vsoch · 2020-03-11T11:25:28Z

ten-simple-rules-dockerfiles.Rmd

@@ -617,14 +656,14 @@ Together we can develop common practices and shared materials for better transpa

 # Acknowledgements {#acknowledgements .unnumbered}

-DN is supported by the project Opening Reproducible Research II (\href{https://o2r.info/}{https://o2r.info/}; \href{https://www.uni-muenster.de/forschungaz/project/12343}{https://www.uni-muenster.de/forschungaz/project/12343}) funded by the German Research Foundation (DFG) under project number PE 1632/17-1.
+DN is supported by the project Opening Reproducible Research II (\href{https://o2r.info/}{https://o2r.info/}; \href{https://www.uni-muenster.de/forschungaz/project/12343}{https://www.uni-muenster.de/forschungaz/project/12343}) funded by the German Research Foundation (DFG) under project number PE 1632/17-1.  DN and SJE are supported by a Mozilla mini science grant.

 # Contributions {#contributions .unnumbered}


@vsoch is running on fumes because she is a research software engineer with no way to self-fund other than begging for... oh wait nevermind. :)

vsoch · 2020-03-11T11:54:48Z

One quick global comment - we need to establish early on the scope (small data science scripts) of this paper. We cannot open with a flowery and misleading statement about reproducible workflows because we don’t get anywhere near discussing scale, workflow execution, and generally more complex (real world) pipelines that use multiple containers, might interact with TB of data, and be run at scale with some cloud provider or HPC job manager or similar. We don’t want someone reading this that falls into that boot camp to roll their eyes because they intuit that the authors don’t have a clue about this scale of work.

nuest · 2020-03-12T12:30:38Z

Thanks @sje30 for the comments and pointers, and @vsoch for already pushing ideas further. I think there are some really good ideas for improvement here!

@vsoch How should we proceed? I'd suggest to merge this as is, and then we can take on the concrete suggestions made: You could do the intro (#55) and I could handle what Stephen and you discussed (working on the "[SJE:...]" stuff), including the suggestion to sacrifice the "template" rule for splitting up another long rule. Both of us can

Should we do this with HackMD's GitHub integration, so we can work simultaneously?

vsoch · 2020-03-12T15:02:37Z

I agree to merge, and my preference would be to do my changes after you are finished. Don’t worry about time, I can be pretty speedy.

nuest · 2020-04-02T12:04:46Z

ten-simple-rules-dockerfiles.Rmd

+databases, you can readily find containers [SJE: where do you look to find containers?] that come installed with
+all the software that you need.  Second, in the case that there isn't
+an existing container for your needs, you might next look to
+well-maintained tools to help with `Dockerfile` generation.  These


I think we cover that in the introduction, where we say: because people look everywhere, we wrote this article.

nuest · 2020-04-02T12:06:39Z

ten-simple-rules-dockerfiles.Rmd

@@ -83,8 +88,7 @@ The differences between a helpful, stable `Dockerfile` and one that is misleadin
 A commitment to this article's rules can ensure that workflows are reproducible and reusable, that computing environments are understandable by others, and researchers can collaborate effectively.
 Their application should not be triggered by the publication of a finished project, but be weaved into day-to-day habits (cf. thoughts on openness as an afterthought by @chen_open_2019 and on computational reproducibility by @donoho_invitation_2010).

-To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`.
-This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.
+To start with, we assume you have a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a single command, for example `make my_workflow`, `Rscript analysis.R`, or `python3 paper-plots.py`. (A key constraint is Docker containers can only run open source software, so environments like Mathematica and Matlab cannot currently be easily used within containers.) This execution should be able to be triggered by command-line instruction, which is possible for all generic programming languages and widely used workflow tools, but may also be opening and starting a process with a graphical user interface.


@vsoch I've started an example Dockerfile in the latest draft - not sure how to capture your thoughts in the example though. I totally agree we should not oversimplify things!

I did make the sentence a bit more encompassing though, please check.

nuest · 2020-04-02T15:35:26Z

ten-simple-rules-dockerfiles.Rmd

@@ -292,7 +316,7 @@ A recommended ordering based on these considerations is as follows.
 4. labels
 5. `RUN`/`ENTRYPOINT`

-Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container. 
+Finally, as a supplement to content inside the `Dockerfile`, it is good practice to also write a section in a `README` alongside the `Dockerfile` for exactly how to build, run, and otherwise interact with the container.  [SJE: really?  Doesn't that mean you end up writing the instructions twice -- once in the Dockerfile as a comment, and once in the README?  Surely the README could just say:  "Please read the end of the Dockerfile for instructions on how to run the docker image."]


nuest · 2020-04-02T17:14:45Z

ten-simple-rules-dockerfiles.Rmd

@@ -418,6 +442,9 @@ For example, software versions, maintainer contact information, along with vendo
 The OCI Image Format Specification provides some common label keys (see the "Annotations" section in @opencontainers_image-spec_2017) to help standardize field names across container tools, as shown below.
 These labels match the `org.label-schema.*` specification, which has been deprecated in favour or the new namespace but are still found a lot in existing containers.

+
+[SJE: this was new to me.  On the one hand, I like the idea of LABEL, but have they really been adopted by the community?  The document from 2017 cited above hasn't moved on since Jun 2017.  Is this really a thing?  And is the Dockerfile the right place for all this metadata?  It feels like a bit too fine-grained for my liking.  But if this is good practice, then I shall start adopting it!  ]
+


I think whatever rules I come up with providing metadata, users are likely to ignore them :-) So I gave up the extra rule for "metadata", and

moved the "build arguments" content to the "Favour clarity" rule and stopped recommending them.

shortend the Label-content and moved it up to the "document" rule

Moved the "formatting" part out of the new "document" rule

A lot of reorganising trigger by this comment :-) Hope you agree it is an improvement.

nuest · 2020-04-02T17:33:15Z

ten-simple-rules-dockerfiles.Rmd

@@ -601,6 +638,8 @@ You can do this shortly before submitting your reproducible workflow for peer-re

 To demonstrate the ten rules, we maintain a GitHub repository with example `Dockerfile`s, some of which we took from public repositories and updated to adhere to the rules (see `Dockerfile.original`): [https://github.com/nuest/ten-simple-rules-dockerfiles/](https://github.com/nuest/ten-simple-rules-dockerfiles/)

+[SJE: I would like to see these annotated.  Perhaps this is something I can meaningfully do?]


It would be great if you could take a stab at finding Dockerfiles we can both "test" the rules and demonstrate them. See also #4

nuest · 2020-04-02T17:48:15Z

ten-simple-rules-dockerfiles.Rmd

@@ -524,6 +556,9 @@ This variety of approaches render seemingly more convenient uncontainerised envi
 # 8. Establish templates for new projects {-}
 \rulelabel{8}{rule:templates}

+
+[SJE: this seemed a bit lacking in meat, compared to the other rules.  Do you have an example Dockerfile template that you are thinking of?]
+


I've moved the "rich" ideas to other sections and dropped this rule.

nuest · 2020-04-02T17:50:13Z

ten-simple-rules-dockerfiles.Rmd

@@ -584,6 +619,8 @@ Third, from time to time you can reduce the system resources occupied by Docker
 After a prune is performed, it follows naturally to rebuild a container for local usage, or to pull it again from a newly built registry image.
 This habit can be automated with a cron job [@wikipedia_contributors_cron_2019].

+
+[SJE:  "sure, okay, I'd like to export an image.  Give me a recipe to show me how to do it, e.g. for this paper!"]


@sje30 Do you suggest to export an image to render the article to Zenodo and use that as a live example here?

nuest · 2020-04-02T17:51:42Z

ten-simple-rules-dockerfiles.Rmd

@@ -490,7 +520,9 @@ docker run -p 8888:8888 jupyter/datascience-notebook:7a0c7325e470
 [I 15:44:31.323 NotebookApp] Use Control-C to stop this server and [..]
 ```

-A person who is unfamiliar with Docker but wants to use your image may rely on graphical tools like Kitematic [@docker_kitematic_2019] or [ContainDS](https://containds.com/) for assisstance in managing containers on their machine without using the Docker CLI.
+[SJE: that's good, does rstudio do similar?  need to check.]


@sje30 No, rstudio containers do not do that, you always have to login manually.

nuest · 2020-04-02T18:08:36Z

@sje30 @vsoch I added a few comments to document my changes, which are now in the master branch: #54 (review)

@vsoch Can you do the next iteration? IMO the structure improved thanks to you and Stephen discussing, and now we need to cut out some content to make the size manageable.

Let me know when you will do this - if you can only do it next week then I'll try to resolve some of the open issues, namely #16 #17 #18 #22

vsoch · 2020-04-02T18:15:19Z

sure! I can do another pass over and PR today. Stay tuned!

sje30 added 3 commits September 16, 2019 10:02

Life is better with a Makefile

cac0837

Merge branch 'master' of github.com:nuest/ten-simple-rules-dockerfiles

507107a

brain dump ...

1261df8

this is a mix of simple typos and comments. search for my initials for the comments, which should also be in [...] and then ask me to explain if anything unclear.

vsoch reviewed Mar 11, 2020

View reviewed changes

ten-simple-rules-dockerfiles.Rmd Show resolved Hide resolved

vsoch reviewed Mar 11, 2020

View reviewed changes

nuest changed the title ~~Sje~~ Stephen's edits Mar 12, 2020

nuest merged commit dc07543 into master Apr 2, 2020

nuest reviewed Apr 2, 2020

View reviewed changes

nuest added a commit that referenced this pull request Apr 2, 2020

(WIP) following up on #54

4d5e3d3

		@@ -601,6 +638,8 @@ You can do this shortly before submitting your reproducible workflow for peer-re

		To demonstrate the ten rules, we maintain a GitHub repository with example `Dockerfile`s, some of which we took from public repositories and updated to adhere to the rules (see `Dockerfile.original`): [https://github.com/nuest/ten-simple-rules-dockerfiles/](https://github.com/nuest/ten-simple-rules-dockerfiles/)

		[SJE: I would like to see these annotated. Perhaps this is something I can meaningfully do?]

Stephen's edits #54

Stephen's edits #54

Conversation

sje30 commented Mar 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vsoch commented Mar 11, 2020 • edited Loading

nuest commented Mar 12, 2020

vsoch commented Mar 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nuest commented Apr 2, 2020

vsoch commented Apr 2, 2020

vsoch commented Mar 11, 2020 •

edited

Loading