Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Comment) Rule 0 - Don't use docker #96

Open
sdettmer opened this issue Jul 6, 2022 · 6 comments
Open

(Comment) Rule 0 - Don't use docker #96

sdettmer opened this issue Jul 6, 2022 · 6 comments

Comments

@sdettmer
Copy link

sdettmer commented Jul 6, 2022

I would like to add the “Rule 0” for Writing Dockerfiles for Reproducible Artifacts" and it is:

“Do not use docker”.

Docker makes it hard to be reproducible, the eco system (docker hub, tutorials…) are changing and hard to “burn on DVD and put into a safe” (e.g. for escrow). Whatever starts with “apt update” cannot be reproducible by definition ("depends on the internet") and this is very common in docker communities.

Even if having all input data reproducible, then the docker images still are not, because docker has no way to omit or fix timestamps included in the built artifacts, so was can at most be “functional reproducible”, which is hard to prove (reproducibility is easy to prove, just secure hash the input and the result – same hashes (for same input), then it must have same result, if and only if hashes match it is reproducible).

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

I'm going to disagree here - reproducibility is a continuum, and using a Docker image (e.g., stored in a registry with care taken so the image isn't purged) is more reproducible than not, and definitely needing to rebuild is less reproducible than having the image already built. You could then easily pull down to a read only SIF (Singularity container) to have a more time-worthy artifact.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch thank you for your quick reply!
What do you mean with "reproducibility is a continuum"?

Why is an existing externally stored docker image more reproducible that one that is built automatically? In best case, the image is reproducible somehow by someone (maybe it is not even documented, how to do so), given that its hidden decencies are. But in practice it is likely that this image or one of the decencies invoke some apt update and by this are not reproducible (there are heaps of other issues possible of course - maybe the image was created using docker commit after manual crafting. In practice, it might not even be available any longer, maybe the maintainer got in legal focus and decided to stop publishing all work (already happend), or the registry goes bankrupt (most registries don't earn money and have no positively working business model, such as codehaus.org). Things evolve, for example try to find all source code packages for SuSE Linux 1.0 to check if SuSE archives have really the right file. I try to collect some old libraries every now and then and often it is hard or impossible, or it turns out that the supplier used some patches (which got lost).

How could you take care of any registry outside your control? codehaus.org purged everything (nowadays not even the domain exists), two years ago nobody could image that this would ever happen. Maven and nodejs packages were deleted after new maintainers took or bought packages, in some cases there was even malicious content provided instead. Remember when "left pad" was deleted? It was in the news everywhere. This broke all builds using leftpad, and every build that broke obviously was not reproducible.

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

What do you mean with "reproducibility is a continuum"?

It's not a binary state - reproducible or not reproducible - it's a continuum that ranges from perhaps poorly reproducible to "best effort" reproducible (and I'd argue perfect reproducibility is more of a pipe dream).

Why is an existing externally stored docker image more reproducible that one that is built automatically?

Rebuilding an image requires all remote installs, etc. to still be present. As you noted, this isn't always reliable. Instead, pulling a pre-built container at least promises to get the previously built layers. Is it perfect? Of course not - as you noted even registries can go away. But retrieving the same layers and containers that someone used for an analysis is slightly more reproducible (in my opinion) than building afresh and not being sure you have the same software, updates, etc. And of course you would want to pull by digest and not tag, which is a moving target.

How could you take care of any registry outside your control?

You cannot. You must be resilient to the likely migration and change. E.g., in the CI world we've jumped around from Travis to Circle, from Circle to GitHub workflows, and I'm sure I'll need to jump again. It's part of the "scrappy and resilient researcher / research software engineer" life - we take advantage of what is available to us at a particular time, and when the time comes (a service goes away) we refactor for a different one.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch thank you for your detailed answer. Ahh I see.
I see two major levels of reproducibility: "functionally reducing" and "reproduced". The first means the re-created artifacts behave exactly the same. Or maybe "very very similar", so this could be a "continuum range". The second means exactly the same. It is absolutely binary, yes or no, hash matches or not (and for complex systems very hard to reach). The advantage is, that is very easy to prove (same hash and you are done). In this case, the build process is validated (in typical practice, not in theory, and I saw such systems fail). There are tool that unify properties like timestamps or build IDs in e.g. binaries. They can sometimes then process binary equal results and thus help "proving" (typical practical, not mathematical) that the artifacts are probably functionally equal, "funcationally reproduced".

If an image has remote installs, it is not reproducible and thus must be stored as source (local copy). Additionally, it might be impossible to maintain, for example if in ten years a little small bug needs to be fixed, lets say the zlib overflow bug, it can get almost impossible: you don't find the needed packages to install and also not the sources and so on...

If you even rely on external build systems, you cannot know whether anything is reproducible at all, maybe GitHub uses a secret Microsoft technology to make things looking as they would work to ensure you remained locked in - and your scripts fail on any other infrastructure, who knows.

If a such service goes away, your reproducibility goes away: if your package builds in GitHub only, it can be impossible to produce the same outside - and at least you have to to modify the input, e.g. Dockerfile, and since Docker cannot produce binary the same input, it can be hard to prove that the new output is even functional reproducing...

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

I would be very interested to see an image that is used ten years later. Software is living in a way - the libraries that are valued and used will be updated (and have new containers) and the people that use them will follow suite. I much prefer older software dying/going away and the ecosystem continuing to change and grow.

Again, part of being a research software engineer is flexibility and portability. When something goes away, we move and find another way. This "perfectly reproducible and reliable" means that you speak of is a fantasy.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch I already had to build such old software to (help others) proving certain aspects. Actually it was easier with lets say 20 years old software and with "modern" technologies it seems to get more and more difficult. We use docker with certain specific rules to keep our built process predictable (reproducible here is still a long way to go). That's why I stated my comments here: my experiences and derived conclusions / rules do differ in some points essentially or are even contrary (for example my 1st rule is: "during build do not use anything from the internet").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants