Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comments about rule 8: "Make the image one-click runnable" #103

Open
sdettmer opened this issue Jul 6, 2022 · 5 comments
Open

comments about rule 8: "Make the image one-click runnable" #103

sdettmer opened this issue Jul 6, 2022 · 5 comments

Comments

@sdettmer
Copy link

sdettmer commented Jul 6, 2022

comments about rule 8: "Make the image one-click runnable"

To be reproducible, the exact data needs to be part of. So it could just be processed during the build, that actually is what reproducibility is all about (it does not matter when something is build or processed, the result it always exactly the same). If there is someone that needs to click, it is not the maximum automation, and if there is something to click with, the environment probably is not best suited for reproducible builds. So it is possible to automate until no clicks are needed, because the results are already there - this is most reproducible. For new data a new image (or result) can be created (automatically).

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

This is what workflow managers are for, for which many use containers. This paper is scoped to just talking about containers.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch Thank you for your quick reply. The document is called "Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science" and I think it is clear that it is most simple and best reproducible to include the data in the dockerfile (as discussed at rule 7), and if so, the result can also be included in the docker file, and if so, it must not even be ran. By this, it cannot be run wrongly, which can be an advantage in corner cases.
Of course, other requirements such as maintainability may force to separate container images and processing data, thus preventing storing results in the container, but then this rule should be in the "Ten Simple Rules for Writing Dockerfiles for Maintainable Data Science" document :)

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

Yes, but if your data is 7TB you aren’t going to put it in a container. That statement applies to small data only (which to be fair, is quite a lot). If there are identifiers in the data you also couldn’t easily share it publicly. So it’s not always possible or feasible to do so.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch Yes, I see, but the rule requires that even small data (that could be easily shared) still must not be stored inside the container but mounted, doesn't it?
So I think it is like "If data is larger, (unfortunately) it cannot be stored in the container, so it can be only mounted at run time".
(I see that for maintainability it probably is better to mount smaller data as well, especially assuming that it is available in some archive anyway.)

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

The rule does not explicitly state that - it targets "large" datasets time and time again, and suggests that small are OK (and the point could have been made more clear).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants