Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comment about rule 6: "Use version control" #101

Open
sdettmer opened this issue Jul 6, 2022 · 3 comments
Open

comment about rule 6: "Use version control" #101

sdettmer opened this issue Jul 6, 2022 · 3 comments

Comments

@sdettmer
Copy link

sdettmer commented Jul 6, 2022

comment about rule 6: "Use version control"

Surely there are many reasons forcing any reasonable process to use version control, but reproducibility IMHO is not.

Burning everything (all archives, the VM to build, all sources, packages and tools) on DVDs and store together with a suited hardware in a safe allows to reproduce this version without version control.

Actually policies exists requiring storing every release as ZIP archive(s) on WORM (write once, read many) media.

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

I don't think I've ever once written something to DVD, at least in my adult years :)

This is controversial to say, but I think "perfect" reproducibility is impossible. We do our best, and version control helps not only to have a local copy and remote, but so others clone and there is redundancy. Is that perfect? Maybe not. But it's better than not using it. GitHub/Gitlab and generally version control has probably done more for science than anything I can think of, and I don't see these same researchers having the resources or desire to burn to DVD. And frankly, if software isn't valuable enough to be used by many or supported, I think it should go away. The best and most valuable will have energy put into them to persist and survive. A blind goal of "saving all software because" is not a realistic one.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch thank you for your quick reply.

Indeed, DVD is more a placeholder for (logically) immutable storage, because people tend to intrusively assign suited properties (like read-only -> fixed content, self-contained, interoperable, free of links to external decencies). In practice that are hard disks (RDX) or artifacts in software systems (repositories or version control systems). These are only immutable because of organizational measures (such as be put into a safe). An encrypted ZIP file in a block chain would also work :)

I think it depends on the damage that can happen when a reproduction fails. Of course in many cases, reproducibility might not urgently be required and I think that is why it is so difficult to archive nowadays (especially with "modern" systems). However, for security reasons it gets more and more attention (if you can rebuilt the Signal App APK content be the same as the one you loaded from the Play Store, you can trust the app on your phone even if you do not want to install yourself, or even cannot, like in Apple World).

I don't know how practically research data is handled, but I expect that for example all LHC data at Cern is, after inital processing to become manageable, stored on read-only storages (like distributed file storages where scientist have no write permissions), logically like a DVD, just a few orders of magnitudes larger :)

Unsupported software may go away, but how ten verify old research results? With AI we will get many such questions "how was it possible, we checked all the training data, but still the net missed a case", like when Google Translate was said to be racist. Of course they checked if anybody did something wrong to get racist behavior, so the process was preproduced showing that the effects solely came from the training data.

Of course you don't need to save all software, but only the one you are using and only if you need to be reproducible. If you might want to understand in ten years, possibly in front of a court, why the AI network in the car that killed several children by constantly driving over them in whatever exceptional case, it might be any hidden detail that is responsible (in forensic, that really happens often - sure, the cases that are obviously wrong are spotted during tests, so only the rare hidden cases later can happen).

Let's assume ten years ago you built an image using codehaus.org packages. Now you need to upgrade everything to newer versions, maybe much newer versions, and you probably need to adjust a lot of code to new API details and so on and so forth, then nobody would be surprised if the one or other detail is different in the resulting software, and so of course it can behave differently in corner cases, and thus of course it can and will produce different results!

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

Most large centers do have means and workflows to store things in formats akin to DVDs, but probably not the little person, the single researcher with a laptop at home and HPC system with limited storage. As a software engineer I will not be doing that.

And yes, different results are expected and OK with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants