-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-29708: support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) #296
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA. This is necessary for legal reasons before we can look at your contribution. Please follow these steps to help rectify the issue:
Thanks again to your contribution and we look forward to looking at it! |
Lib/py_compile.py
Outdated
@@ -137,6 +137,10 @@ def compile(file, cfile=None, dfile=None, doraise=False, optimize=-1): | |||
except FileExistsError: | |||
pass | |||
source_stats = loader.path_stats(file) | |||
sde = os.environ.get('SOURCE_DATE_EPOCH') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Debian, we don't ship binary packages with pyc files, instead we generate them when the package is installed. That makes more sense for us because we can support multiple versions of Python simultaneously, so we generate pycs based on the versions installed on the target system. That way we generally don't have to build all of our binary packages when we add a new Python interpreter, and we don't bloat our binary packages with pyc files.
But in any case, $SOURCE_DATE_EPOCH
is the de facto standard for transmitting this to programs for reproducible builds, so this change is reasonable. However I would like to see documentation and a test.
Will do the changes in a week (after vacation). We had that discussion recently about dropping .pyc files in http://lists.opensuse.org/opensuse-packaging/2017-02/msg00086.html and it seems, many people prefer to keep them, because they want the package manager to track all files belonging to a package and because it makes it easier to predict disk-space consumption. Then there are some advantages in processing speed. Some even went as far as only wanting to ship .pyc files (similar to how we only ship binaries in C) |
did testing, found of the originally affected 476 packages in openSUSE, most were fixed by this patch, except 12 (eric5 nml pcp python3 python3-base python3-Markdown python3-pycparser python-base python-Markdown python-pycparser qpid-cpp salt) which seem to use some other method of producing .pyc files that will need to be patched separately. |
On Feb 25, 2017, at 09:37 PM, Bernhard M. Wiedemann wrote:
We had that discussion recently about dropping .pyc files in
http://lists.opensuse.org/opensuse-packaging/2017-02/msg00086.html and it
seems, many people prefer to keep them, because they want the package manager
to track all files belonging to a package and because it makes it easier to
predict disk-space consumption. Then there are some advantages in processing
speed. Some even went as far as only wanting to ship .pyc files (similar to
how we only ship binaries in C)
Just out of curiosity, does OpenSUSE support multiple parallel installed
versions of Python at the same time? E.g. can I install from system packages
both Python 3.5 and 3.6? If so, do you have to rebuild all your library
packages when you add a new version of Python?
(In the Debian ecosystem, we can support multiple versions, although we tend
to only do that during transitions. Still, it makes transitions easier for
the archive and users. Since we don't ship pyc files in binary packages, we
only have to rebuild packages with extension modules.)
|
Please open a general issue on bugs.python.org to track multiple changes to support reproducible Python builds and link PR to this issue. |
Lib/py_compile.py
Outdated
@@ -137,6 +137,10 @@ def compile(file, cfile=None, dfile=None, doraise=False, optimize=-1): | |||
except FileExistsError: | |||
pass | |||
source_stats = loader.path_stats(file) | |||
sde = os.environ.get('SOURCE_DATE_EPOCH') | |||
if sde and source_stats['mtime'] > int(sde): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur for the unit test. Document the variable in Doc/library/py_compile.rst
What happens if SOURCE_DATE_EPOCH is not a valid integer? Should it be ignored or raise a nice error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The traceback from int()
should be enough, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The traceback if int
does not tell the end user to set the `SOURCE_DATE_EPOCH' environment variable with a valid integer value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might convert sde to int to raise a nice error message if the conversion fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed 4 of 5 comments, but for this one I don't know how you would prefer it to be...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if SOURCE_DATE_EPOCH
is set, it is very uncommon (aka exceptional) to be a non-int, because it is set by programs rather than users most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python should emulate what other compilers do, for example https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e3e8c48c4a494d9da741c1c8ea6c4c0b7c4ff934
Also see https://reproducible-builds.org/specs/source-date-epoch/#idm53
If the value is malformed, the build process SHOULD exit with a non-zero error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say convert the string value to an int in a separate step in a try
, catch the ValueError
, and raise a chained ValueError
that SOURCE_DATE_EPOCH
is malformed.
added test and documentation |
@warsaw Yes. Fedora, OpenSUSE, and others do support multiple Python interpreters. In OpenSUSE's case, they have a global flavors thing that controls this, and in Fedora's case, we can target them specifically. However, as a matter of course, we ship one and only one per major version in the core distribution. Because all paths are already versioned by Python (we didn't do the thing that Debian did and mutate the package install paths so that they are unversioned), parallel installability isn't an issue. We do rebuild things, but it also allows us to be sure things still work as we upgrade Python versions. |
Lib/test/test_py_compile.py
Outdated
f.read(4) | ||
timebytes = f.read(4) # read timestamp from pyc header | ||
f.close() | ||
self.assertEqual(timebytes, b'\x15\xcd\x5b\x07') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make this test a little more obvious, how about changing the bytes literal to (123456789).to_bytes(4, 'little')
?
I have just one suggestion left, but everything else about the branch LGTM. |
updated again. |
@bmwiedemann those failures have been fixed, so just rebase with master again and they should go away. |
574735a
to
aeab488
Compare
rebased. |
@warsaw anything left for me to do here? |
In Nixpkgs we have also a patch to reproducibly build the Python interpreters.
Furthermore, whenever we build a Python package we also set |
@bmwiedemann you will also need to modify Furthermore, for packages I found the following along with |
I still think, this patch is good to merge as it is. |
I have made the requested changes; please review again. I dropped the mtime modification as it is not strictly needed (rpm can already adjust file timestamps (but it is enabled by a different macro than the one setting SOURCE_DATE_EPOCH)) and even if in some cases it leads to .pyc timestamps being inconsistent with .py mtime it only slightly decreases performance, but does not break functionality. As for other compilers, I only know about Free Pascal 'fpc' and one other (forgot which) but AFAIK they dont have good solutions yet either, so pyc files are sort of a prototype in that area. |
Thanks for making the requested changes! @vstinner, @warsaw, @brettcannon: please review the changes made to this pull request. |
except ValueError: | ||
raise ValueError("SOURCE_DATE_EPOCH is not a valid integer") | ||
if source_stats['mtime'] > source_date_epoch: | ||
source_stats['mtime'] = source_date_epoch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we making judgment calls about which one is greater? SOURCE_DATE_EPOCH means I decided I know better than you what the time is, that is the entire point. For example, Arch Linux stores all files in the package archive using the modification time of SOURCE_DATE_EPOCH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH rpm and tar's --clamp-mtime option implement mtime clamping, so that old files keep their timestamps and just timestamps of files created during a build are modified to be SOURCE_DATE_EPOCH
can be tricky to get this right for both cases...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case, the mtime clamp will set the pyc modification time to SOURCE_DATE_EPOCH anyway.
It would result in general files being clamped to a "latest" mtime of SOURCE_DATE_EPOCH and the pyc header timestamps being exactly equal to SOURCE_DATE_EPOCH, so everyone wins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But python does an exact match when checking if it can use pyc files, so old .py files having .pyc files with 'too new' SOURCE_DATE_EPOCH in header would also result in .pyc files not to be used.
I think we might need something like
if source_stats['mtime'] > source_date_epoch or os.environ.get('ALWAYS_SET_MTIME'):
source_stats['mtime'] = source_date_epoch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think worrying about whether .pyc
files will be regenerated based on the static date set in them is getting beyond the scope of reproducible builds. If this is going to have a chance to land in Python 3.7 then let's keep this PR just to the production of .pyc
files and not their consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once the clamping is dropped so that it's a straight setting of the timestamp based on the envvar then this PR is ready to go (assuming my docs changes build appropriately 😁 ).
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
IMHO that does not make sense. The original bug was about .pyc header timestamps not matching what .py timestamps end up in packages and some of it will still be there, depending on whether all timestamps or just new ones are modified. |
The original report was about python not respecting SOURCE_DATE_EPOCH. Respecting SOURCE_DATE_EPOCH imho means using exactly that. Note that Arch Linux has recently gained support for reproducing packages, and we determined that the best policy was to make the package manager itself respect SOURCE_DATE_EPOCH, so (Reproducible builds should not have implementation-defined dependencies like setting SOURCE_DATE_EPOCH to the past rather than the future. The reproducible-builds documentation firts references using --mtime, and then says "in some cases" you might prefer to use --clamp-mtime, which does not sound like a ringing endorsement.) |
We should take this discussion over to https://bugs.python.org/issue29708 to get this resolved. |
PR #5200 was merged, so we do not need this approach anymore => closing |
It is now possible to cross-compile Stackless for Windows on ARM and ARM64.
Created by Bernhard M. Wiedemann <bmwiedemann@suse.de> -- gh#python#296
Created by Bernhard M. Wiedemann <bmwiedemann@suse.de> -- gh#python#296 Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
Created by Bernhard M. Wiedemann <bmwiedemann@suse.de> -- gh#python#296 Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
Code is originally from gh#python#296 (never released in this form). Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
See https://reproducible-builds.org/ for why this is good idea and https://reproducible-builds.org/specs/source-date-epoch/ for the definition of this variable. Background: In some distributions like openSUSE, binary rpms contain precompiled .pyc files. And packages like amqp or twisted dynamically generate .py files at build time so those have the current time and that timestamp gets embedded into the .pyc file header. When we then adapt file timestamps in rpms to be constant, the timestamp in the .pyc header will no more match the .py timestamp in the filesystem. The software will still work, but it will not use the .pyc file as it should. Original version which doesn't force hashed *.pyc files Code is originally from gh#python#296 (never released in this form). Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
The compiled .pyc files contain time stamp corresponding to the compile time. This prevents binary reproducibility. This patch allows to achieve binary reproducibility by overriding the build time stamp by the value exported via SOURCE_DATE_EPOCH. Patch by Bernhard M. Wiedemann, backported from python/cpython#296 [YOCTO#11241] (From OE-Core rev: 2a044f1e4f5c63e11e631b31f741c7aabfa6f601) Signed-off-by: Juro Bystricky <juro.bystricky@intel.com> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
The compiled .pyc files contain time stamp corresponding to the compile time. This prevents binary reproducibility. This patch allows to achieve binary reproducibility by overriding the build time stamp by the value exported via SOURCE_DATE_EPOCH. Patch by Bernhard M. Wiedemann, backported from python/cpython#296 [YOCTO#11241] (From OE-Core rev: 2a044f1e4f5c63e11e631b31f741c7aabfa6f601) Signed-off-by: Juro Bystricky <juro.bystricky@intel.com> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
See https://reproducible-builds.org/ for why this is good
and https://reproducible-builds.org/specs/source-date-epoch/
for the definition of this variable.
Background:
In some distributions like openSUSE, binary rpms contain precompiled .pyc files.
And packages like amqp or twisted dynamically generate .py files at build time
so those have the current time and that timestamp gets embedded
into the .pyc file header.
When we then adapt file timestamps in rpms to be constant,
the timestamp in the .pyc header will no more match
the .py timestamp in the filesystem.
The software will still work, but it will not use the .pyc file as it should.
https://bugs.python.org/issue29708