Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-29708: support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) #296

Closed
wants to merge 2 commits into from

Conversation

bmwiedemann
Copy link
Contributor

@bmwiedemann bmwiedemann commented Feb 25, 2017

See https://reproducible-builds.org/ for why this is good
and https://reproducible-builds.org/specs/source-date-epoch/
for the definition of this variable.

Background:
In some distributions like openSUSE, binary rpms contain precompiled .pyc files.

And packages like amqp or twisted dynamically generate .py files at build time
so those have the current time and that timestamp gets embedded
into the .pyc file header.
When we then adapt file timestamps in rpms to be constant,
the timestamp in the .pyc header will no more match
the .py timestamp in the filesystem.
The software will still work, but it will not use the .pyc file as it should.

https://bugs.python.org/issue29708

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA. This is necessary for legal reasons before we can look at your contribution. Please follow these steps to help rectify the issue:

  1. If you don't have an account on b.p.o, please create one
  2. Make sure your GitHub username is listed in "Your Details" at b.p.o
  3. If you have not already done so, please sign the PSF contributor agreement. The "bugs.python.org username " requested by the form is the "Login name" field under "Your Details".
  4. If you just signed the CLA, please wait at least one US business day and then check "Your Details" on bugs.python.org to see if your account has been marked as having signed the CLA (the delay is due to a person having to manually check your signed CLA)
  5. Reply here saying you have completed the above steps

Thanks again to your contribution and we look forward to looking at it!

@@ -137,6 +137,10 @@ def compile(file, cfile=None, dfile=None, doraise=False, optimize=-1):
except FileExistsError:
pass
source_stats = loader.path_stats(file)
sde = os.environ.get('SOURCE_DATE_EPOCH')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Debian, we don't ship binary packages with pyc files, instead we generate them when the package is installed. That makes more sense for us because we can support multiple versions of Python simultaneously, so we generate pycs based on the versions installed on the target system. That way we generally don't have to build all of our binary packages when we add a new Python interpreter, and we don't bloat our binary packages with pyc files.

But in any case, $SOURCE_DATE_EPOCH is the de facto standard for transmitting this to programs for reproducible builds, so this change is reasonable. However I would like to see documentation and a test.

@bmwiedemann
Copy link
Contributor Author

Will do the changes in a week (after vacation).

We had that discussion recently about dropping .pyc files in http://lists.opensuse.org/opensuse-packaging/2017-02/msg00086.html and it seems, many people prefer to keep them, because they want the package manager to track all files belonging to a package and because it makes it easier to predict disk-space consumption. Then there are some advantages in processing speed. Some even went as far as only wanting to ship .pyc files (similar to how we only ship binaries in C)

@bmwiedemann
Copy link
Contributor Author

did testing, found of the originally affected 476 packages in openSUSE, most were fixed by this patch, except 12 (eric5 nml pcp python3 python3-base python3-Markdown python3-pycparser python-base python-Markdown python-pycparser qpid-cpp salt) which seem to use some other method of producing .pyc files that will need to be patched separately.

@warsaw
Copy link
Member

warsaw commented Feb 27, 2017 via email

@vstinner
Copy link
Member

vstinner commented Mar 2, 2017

Please open a general issue on bugs.python.org to track multiple changes to support reproducible Python builds and link PR to this issue.

@vstinner vstinner changed the title allow for reproducible builds of python packages support SOURCE_DATE_EPOCH env var in pycompile (allow for reproducible builds of python packages) Mar 2, 2017
@@ -137,6 +137,10 @@ def compile(file, cfile=None, dfile=None, doraise=False, optimize=-1):
except FileExistsError:
pass
source_stats = loader.path_stats(file)
sde = os.environ.get('SOURCE_DATE_EPOCH')
if sde and source_stats['mtime'] > int(sde):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur for the unit test. Document the variable in Doc/library/py_compile.rst

What happens if SOURCE_DATE_EPOCH is not a valid integer? Should it be ignored or raise a nice error message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traceback from int() should be enough, right?

Copy link
Contributor

@ogrisel ogrisel Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traceback if int does not tell the end user to set the `SOURCE_DATE_EPOCH' environment variable with a valid integer value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might convert sde to int to raise a nice error message if the conversion fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed 4 of 5 comments, but for this one I don't know how you would prefer it to be...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, if SOURCE_DATE_EPOCH is set, it is very uncommon (aka exceptional) to be a non-int, because it is set by programs rather than users most of the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python should emulate what other compilers do, for example https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e3e8c48c4a494d9da741c1c8ea6c4c0b7c4ff934

Also see https://reproducible-builds.org/specs/source-date-epoch/#idm53

If the value is malformed, the build process SHOULD exit with a non-zero error code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say convert the string value to an int in a separate step in a try, catch the ValueError, and raise a chained ValueError that SOURCE_DATE_EPOCH is malformed.

@vstinner vstinner changed the title support SOURCE_DATE_EPOCH env var in pycompile (allow for reproducible builds of python packages) support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) Mar 2, 2017
@bmwiedemann bmwiedemann changed the title support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) bpo-29708: support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) Mar 3, 2017
@bmwiedemann
Copy link
Contributor Author

added test and documentation

@Conan-Kudo
Copy link

Conan-Kudo commented Mar 6, 2017

@warsaw Yes. Fedora, OpenSUSE, and others do support multiple Python interpreters. In OpenSUSE's case, they have a global flavors thing that controls this, and in Fedora's case, we can target them specifically. However, as a matter of course, we ship one and only one per major version in the core distribution.

Because all paths are already versioned by Python (we didn't do the thing that Debian did and mutate the package install paths so that they are unversioned), parallel installability isn't an issue. We do rebuild things, but it also allows us to be sure things still work as we upgrade Python versions.

f.read(4)
timebytes = f.read(4) # read timestamp from pyc header
f.close()
self.assertEqual(timebytes, b'\x15\xcd\x5b\x07')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make this test a little more obvious, how about changing the bytes literal to (123456789).to_bytes(4, 'little') ?

@warsaw
Copy link
Member

warsaw commented Mar 6, 2017

I have just one suggestion left, but everything else about the branch LGTM.

@bmwiedemann
Copy link
Contributor Author

bmwiedemann commented Mar 6, 2017

updated again.
Are these failures in 'cpython/Lib/test/test_descr.py", line 2888, in test_keywords' really related to my change?
Then I wondered if the added test would fail on big-endian platforms (not many left though)
And there might be a more elegant way to save+restore the environment in the test.
And directly accessing bytes from .pyc files might be a bit fragile - e.g. the header format might need to be changed to avoid the year-2038 bug... I guess the test can be updated then as needed.

@brettcannon
Copy link
Member

@bmwiedemann those failures have been fixed, so just rebase with master again and they should go away.

@bmwiedemann bmwiedemann force-pushed the pycrb branch 2 times, most recently from 574735a to aeab488 Compare March 7, 2017 19:52
@bmwiedemann
Copy link
Contributor Author

rebased.

@bmwiedemann
Copy link
Contributor Author

@warsaw anything left for me to do here?

@FRidh
Copy link

FRidh commented Mar 17, 2017

In Nixpkgs we have also a patch to reproducibly build the Python interpreters.

Furthermore, whenever we build a Python package we also set PYTHONHASHSEED=0 and DETERMINISTIC_BUILD=1.

@FRidh
Copy link

FRidh commented Mar 17, 2017

did testing, found of the originally affected 476 packages in openSUSE, most were fixed by this patch, except 12 (eric5 nml pcp python3 python3-base python3-Markdown python3-pycparser python-base python-Markdown python-pycparser qpid-cpp salt) which seem to use some other method of producing .pyc files that will need to be patched separately.

@bmwiedemann you will also need to modify Lib/importlib/_bootstrap_external.py

Furthermore, for packages I found the following along with PYTHONHASHSEED=0 is needed:
https://bitbucket.org/pypa/wheel/pull-requests/77/bdist_wheel-deterministic-order-record/diff

@bmwiedemann
Copy link
Contributor Author

I still think, this patch is good to merge as it is.

@bmwiedemann
Copy link
Contributor Author

I have made the requested changes; please review again.

I dropped the mtime modification as it is not strictly needed (rpm can already adjust file timestamps (but it is enabled by a different macro than the one setting SOURCE_DATE_EPOCH)) and even if in some cases it leads to .pyc timestamps being inconsistent with .py mtime it only slightly decreases performance, but does not break functionality.

As for other compilers, I only know about Free Pascal 'fpc' and one other (forgot which) but AFAIK they dont have good solutions yet either, so pyc files are sort of a prototype in that area.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@vstinner, @warsaw, @brettcannon: please review the changes made to this pull request.

except ValueError:
raise ValueError("SOURCE_DATE_EPOCH is not a valid integer")
if source_stats['mtime'] > source_date_epoch:
source_stats['mtime'] = source_date_epoch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we making judgment calls about which one is greater? SOURCE_DATE_EPOCH means I decided I know better than you what the time is, that is the entire point. For example, Arch Linux stores all files in the package archive using the modification time of SOURCE_DATE_EPOCH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH rpm and tar's --clamp-mtime option implement mtime clamping, so that old files keep their timestamps and just timestamps of files created during a build are modified to be SOURCE_DATE_EPOCH
can be tricky to get this right for both cases...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case, the mtime clamp will set the pyc modification time to SOURCE_DATE_EPOCH anyway.

It would result in general files being clamped to a "latest" mtime of SOURCE_DATE_EPOCH and the pyc header timestamps being exactly equal to SOURCE_DATE_EPOCH, so everyone wins.

Copy link
Contributor Author

@bmwiedemann bmwiedemann Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But python does an exact match when checking if it can use pyc files, so old .py files having .pyc files with 'too new' SOURCE_DATE_EPOCH in header would also result in .pyc files not to be used.

I think we might need something like

if source_stats['mtime'] > source_date_epoch or os.environ.get('ALWAYS_SET_MTIME'):
    source_stats['mtime'] = source_date_epoch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worrying about whether .pyc files will be regenerated based on the static date set in them is getting beyond the scope of reproducible builds. If this is going to have a chance to land in Python 3.7 then let's keep this PR just to the production of .pyc files and not their consumption.

Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once the clamping is dropped so that it's a straight setting of the timestamp based on the envvar then this PR is ready to go (assuming my docs changes build appropriately 😁 ).

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@bmwiedemann
Copy link
Contributor Author

IMHO that does not make sense. The original bug was about .pyc header timestamps not matching what .py timestamps end up in packages and some of it will still be there, depending on whether all timestamps or just new ones are modified.
One simple, but ugly solution would be to have python always set mtime of .py files, so that those match .pyc headers.
The other less ugly solution would be to give hints to python about what the packaging code will do to mtimes later and take that into account.
Or distributions add a small custom patch to the python they ship to match things up... might actually be the cleanest variant, but would probably need a bit more documentation.

@eli-schwartz
Copy link
Contributor

The original report was about python not respecting SOURCE_DATE_EPOCH. Respecting SOURCE_DATE_EPOCH imho means using exactly that. Note that Arch Linux has recently gained support for reproducing packages, and we determined that the best policy was to make the package manager itself respect SOURCE_DATE_EPOCH, so makepkg will use it internally for the timestamps of packaged files, among other things.

(Reproducible builds should not have implementation-defined dependencies like setting SOURCE_DATE_EPOCH to the past rather than the future. The reproducible-builds documentation firts references using --mtime, and then says "in some cases" you might prefer to use --clamp-mtime, which does not sound like a ringing endorsement.)

@brettcannon
Copy link
Member

We should take this discussion over to https://bugs.python.org/issue29708 to get this resolved.

@bmwiedemann
Copy link
Contributor Author

PR #5200 was merged, so we do not need this approach anymore => closing

akruis pushed a commit to akruis/cpython that referenced this pull request Aug 1, 2021
It is now possible to cross-compile Stackless for Windows on ARM and ARM64.
jaraco pushed a commit that referenced this pull request Dec 2, 2022
mcepl added a commit to openSUSE-Python/cpython that referenced this pull request Apr 3, 2024
Created by Bernhard M. Wiedemann <bmwiedemann@suse.de> -- gh#python#296
mcepl added a commit to openSUSE-Python/cpython that referenced this pull request Apr 4, 2024
Created by Bernhard M. Wiedemann <bmwiedemann@suse.de> -- gh#python#296

Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
mcepl added a commit to openSUSE-Python/cpython that referenced this pull request Apr 4, 2024
Created by Bernhard M. Wiedemann <bmwiedemann@suse.de> -- gh#python#296

Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
mcepl pushed a commit to openSUSE-Python/cpython that referenced this pull request Apr 4, 2024
Code is originally from gh#python#296 (never released in
this form).

Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
mcepl pushed a commit to openSUSE-Python/cpython that referenced this pull request Apr 4, 2024
See https://reproducible-builds.org/ for why this is good idea
and https://reproducible-builds.org/specs/source-date-epoch/
for the definition of this variable.

Background:
In some distributions like openSUSE, binary rpms contain precompiled .pyc files.

And packages like amqp or twisted dynamically generate .py files at build time
so those have the current time and that timestamp gets embedded
into the .pyc file header.
When we then adapt file timestamps in rpms to be constant,
the timestamp in the .pyc header will no more match
the .py timestamp in the filesystem.
The software will still work, but it will not use the .pyc file as it should.

Original version which doesn't force hashed *.pyc files

Code is originally from gh#python#296 (never released in
this form).

Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
daregit pushed a commit to daregit/yocto-combined that referenced this pull request May 22, 2024
The compiled .pyc files contain time stamp corresponding to the compile time.
This prevents binary reproducibility. This patch allows to achieve binary
reproducibility by overriding the build time stamp by the value
exported via SOURCE_DATE_EPOCH.

Patch by Bernhard M. Wiedemann, backported from python/cpython#296

[YOCTO#11241]

(From OE-Core rev: 2a044f1e4f5c63e11e631b31f741c7aabfa6f601)

Signed-off-by: Juro Bystricky <juro.bystricky@intel.com>
Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
daregit pushed a commit to daregit/yocto-combined that referenced this pull request May 22, 2024
The compiled .pyc files contain time stamp corresponding to the compile time.
This prevents binary reproducibility. This patch allows to achieve binary
reproducibility by overriding the build time stamp by the value
exported via SOURCE_DATE_EPOCH.

Patch by Bernhard M. Wiedemann, backported from python/cpython#296

[YOCTO#11241]

(From OE-Core rev: 2a044f1e4f5c63e11e631b31f741c7aabfa6f601)

Signed-off-by: Juro Bystricky <juro.bystricky@intel.com>
Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.