Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reproducibility of artifacts #2140

Closed
msarahan opened this issue Jun 30, 2017 · 14 comments
Closed

Improve reproducibility of artifacts #2140

msarahan opened this issue Jun 30, 2017 · 14 comments
Labels
locked [bot] locked due to inactivity stale::closed [bot] closed after being marked as stale stale [bot] marked as stale due to inactivity

Comments

@msarahan
Copy link
Contributor

msarahan commented Jun 30, 2017

@bollwyvl raised this privately. He would like conda packages to be completely verifiable. There's some good guidelines on this at https://reproducible-builds.org/

@bollwyvl has some preliminary work implemented in a jupyter notebook that I'm happy to share if anyone would like to see.

@bollwyvl
Copy link

bollwyvl commented Jun 30, 2017

Thanks, @msarahan! Also thanks for not just posting that notebook... it's of course broken 😈 .

In summary:

My early work has a lot of shortcomings:

  • only tried with conda-build 2.1.6
  • only tried with pure python packages: a toy example, and conda-build 2.1.16 itself
  • only tried on OSX

Observations in no particular order:

  • info/index.json wasn't being sorted (has been fixed on master) 🎆
  • tarballs preserve timestamp and owner
    • easy enough to fix, either by post-processing or something
  • info/files is not sorted
  • info/recipe/meta.yaml has non-deterministic ordering of sets (mainly package spec listings)
  • versioneer.py (and maybe other things that hack how pip/setuptools do their thing) apparently don't respect PYTHONHASHSEED and DETERMINISTIC_BUILD, so they end up with 1-2 bytes of header differences in pyc files
  • info/about.json
    • there's no easy way to extractroot_pkgs directly to reproduce the build environment
    • env_vars includes username and other transient path information

In working on this, I have found diffoscope super valuable, and am trying to get it built for conda, but don't have any success there yet!

@mingwandroid
Copy link
Contributor

Good to see someone paying attention to this stuff!

@bollwyvl
Copy link

An update: diffoscope PR on conda-forge:

@bollwyvl
Copy link

bollwyvl commented Jul 25, 2017

@mandeep you are a reproducibility-enhancing MACHINE. 🎸 on.

Been a bit heads-down, but just updated conda-forge/staged-recipes#3281 and conda-forge/staged-recipes#3282 with some naming issues, extra deps, and typos on my part. Also started sneaking in:

export DETERMINISTIC_BUILD=1
export PYTHONHASHSEED=0

...just to see what happens. Of course, I'd love to remove those things if conda-build could start setting them as a matter of course in the temporary build-time scripts!

Additionally, I've started up a reprotest recipe: conda-forge/staged-recipes#3358

@bollwyvl
Copy link

Ha, again reading too much internet will make me dumb.

export DETERMINISTIC_BUILD=1

Apparently, this a nix-specific thing, which they use (with a patch to deep bits of cpython) to clobber the timestamp pyc header bits. Presumably #2234 will fulfill that need... though it may be necessary to reset timestamps after running any recipe-provided patches and before running the build scripts.

@mandeep
Copy link
Contributor

mandeep commented Jul 25, 2017

@bollwyvl I've been reading a lot into this as well and it seems there's a bit more to do on our side regarding timestamps. I checked the pyc files to see if the magic number was being changed and everything looked okay there, which makes me believe that the difference in bytecode is due to timestamps. #2234 won't fix this unfortunately. I think we need to introduce the $DETERMINISTIC_BUILD and $PYTHONHASHSEED environment variables. The PR at python/cpython#296 has some good discussion on this.

@bollwyvl
Copy link

@mandeep the rabbit hole is deep indeed! Unfortunately I can't contribute a whole lot right now, but I definitely want to revisit my reproducibly-building-conda-build once some more of your stuff lands!

@bollwyvl
Copy link

bollwyvl commented Aug 1, 2017

Did some more work on the proposed conda-build --reproduce. This is indeed sticky. The state of what you end up with is dependent on many steps: .condarcs, command line args, etc.

Those seem reasonable; however, the recipe is a bit tougher.

If you try to reuse the info/recipe/meta.yaml.template (by copying or something), it re-templates and runs the solver again, so you'll (maybe) get newer versions. This also regenerates .template and (potentially) injects (local) timestamps into the output meta.yaml!

If you use the info/recipe/meta.yaml, you definitely get a different output tarball, since the original meta.yaml-for-humans will have been lost in the churn.

Starting from (and including) the "human-centric" (multi-)recipe (and build_config), it is reasonable to expect some changes between builds, and as such to get different "final" meta.yamls out. However, to reproduce the exact build env and still get out the same "final" meta.yaml, conda-build could:

  • special-case around recipe dirs that contain a meta.yaml and a meta.yaml.template
  • or offer something like --no-template

@bollwyvl
Copy link

bollwyvl commented Aug 3, 2017

So some more thinking on this:

  • Some crucial things, like where packages are sourced from when conda-build is run, aren't captured anywhere.
    • once a package is reproducible, and built from reproducible packages, this won't matter anway, as the crypto hash will be the same... but we have to get to that point first...
  • the relationship between the source meta.yaml and the meta.yaml.template and meta.yaml in info/recipe/ is tricky for reproducing from a tarball...
  • probably some more stuff currently hidden by the above...

So in the near term, it might make sense to start an experimental wrapper, e.g. conda-build-reproducible that uses the conda-build API, and yields packages compatible with the rest of the conda ecosystem, but does pre-and-post processing to move some information around in order to make more guarantees. i.e. buildinfo, as discussed on #2239.

Continuing down the pipedream:

The "pop quiz" would be that such a tool could, for at least linux-64 and python3.6, reproducibly rebuild the 43 packages in the conda-build runtime dependency chain, the wrapper itself, as well as the TBD packages in the build dependencies of those, ad-hopefully-not-infinitum.

Once that was demonstrated, and which I am not sure has even been explored yet, might be a variant of constructor, e.g. reconstructor that could build reproducible appliances from reproducible packages, and similarly rebuild given a reconstructor appliance and a .buildinfo.

The "midterm exam" from this would be a reconstructor appliance for linux-64, self-hosted in the sense that (given sources) it could reproduce all of the ~45 packages in itself, etc. and (finally) itself.

I'm not sure what the "final exam" would be, presumably other architectures!

@jjhelmus
Copy link
Contributor

jjhelmus commented Jan 26, 2018

Looks like Python 3.7 may provide a method to create deterministic .pyc files via the accepted PEP 552 -- Deterministic pycs

@bollwyvl
Copy link

Python 3.7 may provide a method to create deterministic .pyc files

Very exciting!

I've been looking for a chance to pick this back up! Perhaps there's a confluence of py37, conda-build and conda-forge and something like miniforge that will be a better "midterm" than a 3.6-based solution could be, given the number of python recipes that would be needed to be updated.

@github-actions
Copy link

Hi there, thank you for your contribution!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed automatically if no further activity occurs.

If you would like this issue to remain open please:

  1. Verify that you can still reproduce the issue at hand
  2. Comment that the issue is still reproducible and include:
    - What OS and version you reproduced the issue on
    - What steps you followed to reproduce the issue

NOTE: If this issue was closed prematurely, please leave a comment.

Thanks!

@github-actions github-actions bot added the stale [bot] marked as stale due to inactivity label Aug 23, 2022
@github-actions github-actions bot added the stale::closed [bot] closed after being marked as stale label Sep 22, 2022
@jaimergp
Copy link
Contributor

In my opinion this issue should remain open, as conda-build doesn't provide reproducible builds as of today.

@leofang
Copy link

leofang commented Feb 2, 2023

xref: #4762

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Feb 3, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity stale::closed [bot] closed after being marked as stale stale [bot] marked as stale due to inactivity
Projects
None yet
Development

No branches or pull requests

7 participants