comments about rule 8: "Make the image one-click runnable" #103

sdettmer · 2022-07-06T15:56:48Z

comments about rule 8: "Make the image one-click runnable"

To be reproducible, the exact data needs to be part of. So it could just be processed during the build, that actually is what reproducibility is all about (it does not matter when something is build or processed, the result it always exactly the same). If there is someone that needs to click, it is not the maximum automation, and if there is something to click with, the environment probably is not best suited for reproducible builds. So it is possible to automate until no clicks are needed, because the results are already there - this is most reproducible. For new data a new image (or result) can be created (automatically).

vsoch · 2022-07-06T15:59:39Z

This is what workflow managers are for, for which many use containers. This paper is scoped to just talking about containers.

sdettmer · 2022-07-06T17:24:11Z

@vsoch Thank you for your quick reply. The document is called "Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science" and I think it is clear that it is most simple and best reproducible to include the data in the dockerfile (as discussed at rule 7), and if so, the result can also be included in the docker file, and if so, it must not even be ran. By this, it cannot be run wrongly, which can be an advantage in corner cases.
Of course, other requirements such as maintainability may force to separate container images and processing data, thus preventing storing results in the container, but then this rule should be in the "Ten Simple Rules for Writing Dockerfiles for Maintainable Data Science" document :)

vsoch · 2022-07-06T17:26:17Z

Yes, but if your data is 7TB you aren’t going to put it in a container. That statement applies to small data only (which to be fair, is quite a lot). If there are identifiers in the data you also couldn’t easily share it publicly. So it’s not always possible or feasible to do so.

sdettmer · 2022-07-06T18:04:44Z

@vsoch Yes, I see, but the rule requires that even small data (that could be easily shared) still must not be stored inside the container but mounted, doesn't it?
So I think it is like "If data is larger, (unfortunately) it cannot be stored in the container, so it can be only mounted at run time".
(I see that for maintainability it probably is better to mount smaller data as well, especially assuming that it is available in some archive anyway.)

vsoch · 2022-07-06T18:09:49Z

The rule does not explicitly state that - it targets "large" datasets time and time again, and suggests that small are OK (and the point could have been made more clear).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comments about rule 8: "Make the image one-click runnable" #103

comments about rule 8: "Make the image one-click runnable" #103

sdettmer commented Jul 6, 2022

vsoch commented Jul 6, 2022

sdettmer commented Jul 6, 2022

vsoch commented Jul 6, 2022

sdettmer commented Jul 6, 2022 •

edited

Loading

vsoch commented Jul 6, 2022

comments about rule 8: "Make the image one-click runnable" #103

comments about rule 8: "Make the image one-click runnable" #103

Comments

sdettmer commented Jul 6, 2022

vsoch commented Jul 6, 2022

sdettmer commented Jul 6, 2022

vsoch commented Jul 6, 2022

sdettmer commented Jul 6, 2022 • edited Loading

vsoch commented Jul 6, 2022

sdettmer commented Jul 6, 2022 •

edited

Loading