Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comments about rule 7: "Mount dataset at run time" #102

Open
sdettmer opened this issue Jul 6, 2022 · 9 comments
Open

comments about rule 7: "Mount dataset at run time" #102

sdettmer opened this issue Jul 6, 2022 · 9 comments

Comments

@sdettmer
Copy link

sdettmer commented Jul 6, 2022

comments about rule 7: "Mount dataset at run time"

To be reproducible, there is nothing that can be different at different runtime, all must be the same. Thus it does not matter when to include data. Of course, for practical use on different data or maintenance this might be a good habit.

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

The data itself can be "the same" from an archive - that still doesn't make it feasible to store a very large dataset in the container.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch It might not be most efficient or easy to maintain, but it is best for reproducibility. Of course there are other requirements beside reproducibility, such let's say "maintainability", so the rule should be moved there. Actually a lot of rules lead to conflicts: the more reproducible something is, the more expensive the maintenance might become (alone storing everything is a burden).

But if you have an archive of the data anyway, why not adding it as layer in OCI and be done?

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

It doesn’t sound like you’ve worked with the large datasets that I have in the past, where it would not be reasonable or feasible to add to a container.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch Yes, I did not work with large data, my biggest RAID is only 0.1PB, so I couldn't even store that (I know LHC produces 2 GB/sec and has some hundred PBs stored), but I work on reproducible systems (reproducible in "continuum" sense from previous item, unfortunately :)).
But for small data sets, why should these mounted at runtime? Because the rule says so :D

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

No small datasets are appropriate to add to the container, given it's de-identified.

As an exception, you should include dummy or small test datasets in the image to ensure that a container is functional without the actual dataset, e.g., for automated tests, instructions in the user manual, or peer review (see also “functional testing logic” in [12]). For all these cases, you should provide clear instructions in the README file on how to use the actual (or dummy) data, and how to obtain and mount it if it is kept outside of the image. When publishing your workspace, e.g., on Zenodo, having datasets outside of the container also makes them more accessible to others, for example, for reuse or analysis.

The rule targets large datasets time and time again, so perhaps we didn't make this clear enough because we added a context that it's for review/testing, which it doesn't need to be. Small datasets are fine to include if it's reasonably small.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch yes, I see. Please remember I don't want to critise any work but I like to give input for possible future improvements.
If the rule applies only for large data where it is needed anyway, why have it at all?

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

The rule is saying:

  • for large datasets, bind mounting is useful (the new user isn't going to know about this functionality)
  • for small datasets and files, put them in the container.

It's reasonably stated and although the focus is on large datasets to tell the user about bind mounts, small datasets inside the container (with other small files) is also in scope. This feels like nit picking to me, and losing sight of the audience that the writing was intended for.

@sdettmer
Copy link
Author

sdettmer commented Jul 6, 2022

@vsoch hehe yes it might be
I just wonder why you have a rule like "if you cannot put the data in the container, then don't put it in the container" :)

@vsoch
Copy link
Collaborator

vsoch commented Jul 6, 2022

Because it is possible to put large data in a container, and people do try (and registries vary based on the size / number of layers to accept) so it needs to be explicitly stated. A container is not a storage vehicle for large data - it's for small data and/or software and analysis scripts. We have better means (data archives, object storage) more appropriate for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants