Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rules/python: Option to generate .pyc files for py_library. #1761

Closed
adam-azarchs opened this issue Sep 10, 2022 · 43 comments · Fixed by #1902
Closed

rules/python: Option to generate .pyc files for py_library. #1761

adam-azarchs opened this issue Sep 10, 2022 · 43 comments · Fixed by #1902

Comments

@adam-azarchs
Copy link

Description of the feature request:

Add options on one or more of the py_library rule itself, the python toolchain configuration, or the bazel command line, to generate python bytecode files as part of the build.

In addition to the basic on/off option, it might be useful to include options to

  1. Control the "optimization" level (for what that's worth)
  2. Leave the source .py file out of the build outputs - the .pyc-only use case.

What underlying problem are you trying to solve with this feature?

When python loads a module, it parses the source code to an AST, which it then attempts to cache in bytecode format. Building those as part of the build process solves a few issues:

  1. To quote the documentation for the py_compile module:

    Though not often needed, this function can be useful when installing modules for shared use, especially if some of the users may not have permission to write the byte-code cache files in the directory containing the source code.

    Particularly in the case where bazel is being used to build a tarball of code that includes (but may not be limited to) python, and might then be deployed somewhere that's read-only to most users, it would be useful to be able to include these precompiled bytecode files.

  2. The attempt to compile the bytecode files would fail on syntactically-invalid python code, which is probably a good thing for catching failures earlier on in the build process.

  3. Having .pyc files available improves application startup time. Especially for large python codebases, if some module is transitively imported from thousands of unit tests, currently each of those tests would end up re-parsing the python source file, which is a waste of time. Having the .pyc files is also helpful for improving startup times for "serverless" platforms.

  4. The .pyc files can be substantially smaller than the source files. For situations where application distribution size is important, e.g. "serverless" platforms, this can matter.

  5. Some people place value on the marginal degree of obfuscation and tamper resistance offered by .pyc-only distributions. While reverse-engineering the source from a .pyc file isn't hard, it's also not nothing.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 5.3.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

References/notes:
https://docs.python.org/3/library/py_compile.html
The default compilation mode is TIMESTAMP; this is probably a bad idea. In a bazel build we'd probably want to use UNCHECKED_HASH. Would also need to ensure that the embedded path name was appropriately relative.

From PEP-3147:

Python will still support pyc-only distributions, however it will only do so when the pyc file lives in the directory where the py file would have been, i.e. not in the pycache directory. pyc file outside of pycache will only be imported if the py source file is missing.

This means that in the case where the .py file is still being included, the output path would need to depend on the python interpreter version. This probably would require an attribute to be added to py_runtime for that purpose.

@comius
Copy link
Contributor

comius commented Oct 6, 2022

cc @rickeylev

@rickeylev
Copy link
Collaborator

This is probably fairly easy to do by adding a precompiler attribute to the toolchain that the rules can use.

The tricky part I'm not sure of is what that attribute points to. It can't point to a py_binary itself, since that would be a circular dependency. But we have to use Python to run py_compile somehow.

Maybe allow it to be an executable or a single py file? If it's an executable, run it. If it's a py file, then run "$interpreter $file". This allows creating a pre-built executable if desired, while also allowing it to Just Work.

(this again touches on the problem of how there's several cases where py-binaries want to be used as part of the toolchain)

In a bazel build we'd probably want to use UNCHECKED_HASH

Probably, yes. CHECKED_HASH is somewhat appealing, too, though, because it allows you to modify the source file and see the new behavior without having to rebuild, though.

I'm +1 on defaulting to one and having an option for the other. Maybe base the default on -c opt so optimized builds used unchecked_hash by default, while fastbuilds use CHECKED_HASH by default. I can easily see a server not caring about this difference, while a CLI app would.

I also like the idea of a library-level control here somehow, too. A scenario we've enountered with large packages (e.g. tensorflow), is the sheer number of modules has its own overhead; pyc helps, but being as how modifying them is the exception, not the rule, making them check their hash is wasted work. We'll have to figure out how a target-level and global-flag-level option should interact, though (which has precedence?).

This means that in the case where the .py file is still being included, the output path would need to depend on the python interpreter version. This probably would require an attribute to be added to py_runtime for that purpose.

Yeah, the details of the pyc infix aren't well understood to me. I think PEP 488 is another half of the puzzle. I swear I thought there was a part of the specification that included the "compile mode" of Python itself, too (e.g. debug for non-debug), but I don't see that mentioned.

We also have to make sure that the generated name is recognized by the runtime that will be used. I think this is controlled by BYTECODE_SUFFIXES? One option is to simply stick this onto the toolchain, too. Another would be to pass the necessary suffix to the stub template so it can configure the runtime at startup.

Additionally, the output path depends on if the original .py files is included or not.

The Major.Minor Python version is in the --python_version flag.

@adam-azarchs
Copy link
Author

The tricky part I'm not sure of is what that attribute points to. It can't point to a py_binary itself, since that would be a circular dependency. But we have to use Python to run py_compile somehow.

Maybe allow it to be an executable or a single py file? If it's an executable, run it. If it's a py file, then run "$interpreter $file". This allows creating a pre-built executable if desired, while also allowing it to Just Work.

(this again touches on the problem of how there's several cases where py-binaries want to be used as part of the toolchain)

Yeah, this is kind of a similar problem as the one faced for the coverage tool in bazelbuild/bazel#15590. However, there's an important difference: py_compile is part of the python standard library. Given that, I don't really see a need to offer an option to customization of the compilation tool - bazel could just hard-code running $interpreter -m py_compile, which as far as I can tell would work for all of CPython, PyPy, and MicroPython at least (though I've only tried with CPython).

Probably, yes. CHECKED_HASH is somewhat appealing, too, though, because it allows you to modify the source file and see the new behavior without having to rebuild, though.

I'm +1 on defaulting to one and having an option for the other. Maybe base the default on -c opt so optimized builds used unchecked_hash by default, while fastbuilds use CHECKED_HASH by default. I can easily see a server not caring about this difference, while a CLI app would.

Absolutely! However, that's something you can handle with select, so it doesn't really have to impact the design much.

I think the options we'd want exposed look something like

  1. No compilation.
  2. CHECKED_HASH
  3. UNCHECKED_HASH
  4. pyc-only

As well as the mostly orthogonal choices for "optimization" level. (which I will continue to put in quotes. It's really too bad that there isn't an optimization level which strips out docstrings, which you don't need in a server context, but leaves in assertions, which might be a bit dangerous to strip out)

I also like the idea of a library-level control here somehow, too. A scenario we've enountered with large packages (e.g. tensorflow), is the sheer number of modules has its own overhead; pyc helps, but being as how modifying them is the exception, not the rule, making them check their hash is wasted work. We'll have to figure out how a target-level and global-flag-level option should interact, though (which has precedence?).

This is why I suggested pre-compilation should be an attribute on the py_library rule, as well as possibly the ability to set a default at the toolchain level. For one thing, you might want to have some things be distributed as pyc-only, while others might want to include source. You might also want to be able to control the "optimization" level on a per-target basis. I think the safe approach there would be to just say local specification takes precedence over global specification - I can't really imagine a situation where someone would set an override on a library and we'd not want to trust that they had a good reason to do so. At a nitty-gritty API level, however, it does mean we have to be careful about the value of the attribute one sets for this; one might think that compile_mode = None would be a default of "use the toolchain default," but that would catch someone by surprise if they (somewhat reasonably) read that as meaning "no compilation". Easy enough to handle - just make the default for the attribute be e.g. "from_toolchain" or whatever.

@rickeylev
Copy link
Collaborator

Yeah, this is kind of a similar problem as the one faced for the coverage tool in bazelbuild/bazel#15590. However, there's an important difference: py_compile is part of the python standard library. Given that, I don't really see a need to offer an option to customization of the compilation tool - bazel could just hard-code running $interpreter -m py_compile, which as far as I can tell would work for all of CPython, PyPy, and MicroPython at least (though I've only tried with CPython).

I think invoking the interpreter directly like that is a workable intermediate solution. So I'm +1 on doing that for now. I'm fairly certain we need a custom program to ensure deterministic behavior and pass the necessary info, though. The cli of -m py_compile isn't sufficient.

However, I need to make a few architectural points about why such solutions make it hard or impossible to take full advantage of Bazel's processing model and features.

As some concrete examples:

When using embedded Python, there's no need for a standalone interpreter, so building one simply for precompilation is unnecessary work. (This is particularly salient to me because that's my primary way of building Python programs).

Precompilation is a near perfect fit for using persistent workers, which would eliminate runtime startup overhead. This requires special support by the executable invoked.

Precompilation must run in the exec configuration. This means the runtime itself may be built differently than the runtime actually used, may be run remotely, and/or run on an entirely different platform. So more steps have to be taken to ensure such details don't leak into the output. For example, the launcher script does several things to prevent system settings from affecting the runtime which have to be re-implemented.

Similarly, the exec configuration usually emphasizes time-to-build over runtime-efficiency. This makes sense for typical user build tools, but less so for something like precompilation; precompilation is well defined and it's rate of invocation scales with the size of the overall build, ie the larger your transitive closure, the more it runs, so you're probably going to benefit from the precompiler (and runtime) being built with optimizations enabled. Such optimizations can extend beyond simply e.g. pyc -O levels, and into C-level optimizations that affect the runtime itself; these optimizations are hard to apply during a build, but are easy to apply to to an prebuilt executable which is later run.

This isn't to say simply doing "$runtime -m py_compile" is bad or wrong, just that it leaves a lot sitting at the Bazel Buffet Table.

As well as the mostly orthogonal choices for "optimization" level. (which I will continue to put in quotes. It's really too bad that there isn't an optimization level which strips out docstrings, which you don't need in a server context, but leaves in assertions, which might be a bit dangerous to strip out)

Srsly, right? This has bugged me for years! It's the sort of thing a custom precompiler executable could do, though. This also makes me think, if we exposed an optimization level, it'd probably be better done as a list of values instead of a simple number.

I think the safe approach there would be to just say local specification takes precedence over global specification - I can't really imagine a situation where someone would set an override on a library and we'd not want to trust that they had a good reason to do so.

I agree. A case we have internally is that protocol buffers (which are handled by a special rule) are compiled by default. This is pretty akin to a library-level setting.

At a nitty-gritty API level, however, it does mean we have to be careful about the value of the attribute one sets for this; one might think that compile_mode = None would be a default of "use the toolchain default," but that would catch someone by surprise if they (somewhat reasonably) read that as meaning "no compilation". Easy enough to handle - just make the default for the attribute be e.g. "from_toolchain" or whatever.

I agree.

@adam-azarchs
Copy link
Author

Worth noting mostly for the sake of code archeology that there's this comment in the code currently: https://github.com/bazelbuild/bazel/blob/6d72ca979f8cf582f53452d5f905346e7effb113/src/main/java/com/google/devtools/build/lib/rules/python/PythonSemantics.java#L69-L71
however as far as I can tell the implementation https://github.com/bazelbuild/bazel/blob/9d0163002ca63f4cdbaff9420380d72e2a6e38b3/src/main/java/com/google/devtools/build/lib/bazel/rules/python/BazelPythonSemantics.java#L117-L120
simply copy the sources unaltered. But for what it's worth I'm guessing that at some point in the past someone had an implementation, or at least and attempt at one, probably Google-internal-only, for at least some of this functionality.

@comius comius added P3 and removed untriaged labels Oct 17, 2022
@antonio-arbo
Copy link

Thanks for filing & tracking!
It would be great to have this option, particularly since this seems to render x-machine remote cache in bazel quite less effective since each machine has its own timestamp of the pyc. So it would be much better to have a more stable and safer hash instead of just timestamps.

Do you have a suggestion of a workaround for the time-being?

BTW, for others looking into this. A simple bazel clean -expunge and then rebuild will show many remote cache misses. Confirm it looking into the exec logs as in https://bazel.build/remote/cache-remote .

@Ryang20718
Copy link

I would very much <3 to see this feature.

we have a lot of heavy imports :(

Some of our heavy hitter imports bog down both test & runtime (mainly test)

One of the pain points right now is that we use pytorch in our bazel monorepo. However, because we import it in so many of our libraries, it incurs a large import time overhead (3.8 seconds).

With pycache, it drops down to ~2.5 seconds.

Is the general idea to generate pyc files once when all external third party repos are fetched at analysis time and then to ingest these files as input in py_library?

Would love some pointers on where to start (even if it's just a hacky prototype 😅 )

@rickeylev
Copy link
Collaborator

I'm going to transfer this issue to the rules_python repo because, as of rules_python 0.31.0, we've enabled the rules_python-based implementation of the rules by default for Bazel 7+. This basically means this feature won't land in the Bazel builtin rules, but would instead be in the rules_python implementation.

@rickeylev rickeylev transferred this issue from bazelbuild/bazel Feb 15, 2024
@rickeylev
Copy link
Collaborator

Would love some pointers on where to start (even if it's just a hacky prototype 😅 )

Can do! I've had parts of this mentally sketched out, but haven't thought it through entirely. This is already partially wired into the rules_python implementation, too.

There's a stub function, maybe_precompile() src that gets called. As output, it returns the new set of sources a library should use, which is either (a) the py files (no precompiling), or (b) both the py files and pyc files, or (c) just the pyc files.

All this function really has to do is loop over the sources, run a tool to generate the pyc files, and return them. The important thing is that it generates deterministic output (i.e. no timestamp based pyc files). It'll probably need some flags to control behavior a bit, as discussed, but lets focus on a prototype first. Anyways, the core code of this function is pretty simple:

pycs = []
for src in src_files:
    # Generating a file in another package is an error, so we have to skip
    # such cases.
    if ctx.label.package != src.owner.package:
        continue
    pyc = ctx.actions.declare_file(..., sibling = src)
    ctx.actions.run(
        executable = ...,
        inputs = [src],
        arguments = ... + [src.path, pyc.path],
        outputs = [pyc],
        mnemonic = "PyCompile",
        progress_message = "Compiling Python %{input}",
        toolchain = ...,
    )
    pycs.append(pyc)
return pycs

That's the first part, which is easy.

The second part is defining the tool that is run to generate the pyc. For the purposes of experimenting, you can add an implicit attribute on py_library pointing to some executable to run. The tricky part here is the executable it points to can't be a regular py_binary -- that would result in a circular dependency. For experimenting, define it how you like to work out details (e.g. a sh_binary that just calls the system python would suffice for testing purposes).

Within Google, we prebuild a binary and use an implicit attribute. A prebuilt binary is actually a good thing in this case, but not required. It allows neat things like building an binary with optimizations, having a generic binary that can generate byte code for any python version, building a cc_binary that embeds python, or heck, you could implement a byte code compiler in Rust or whatever if you really wanted to. I'm getting ahead of myself. Anyways.

It's probably best to just start with invoking something like python precompile.py <src> <pycout>. That's going to be pretty close to how a non-bazel python works.

Using an implicit attribute, however -- that won't work well in the Bazel world where the environment isn't as tightly controlled and where there are many more platform configurations to worry about.

In the open Bazel world, this precompiler tool needs to come from a toolchain. This is because we need toolchain resolution to find a tool that matches the target config (the target python version we need to generate byte code for) that can run on one of our available execution platforms. A basic implementation that re-used the same runtime would look something like this, I think:

toolchain(
  name = "precompiler_3_10_linux_toolchain",
  target_compatible_with = [
    "@rules_python//python/config_settings:is_python_3.10",
  ],
  exec_compatible_with = ["@platforms//os:linux"],
  toolchain = ":precompiler_toolchain_impl",
  toolchain_type = "//python:precompiler_toolchain_type",
)

py_precompiler_toolchain(
  name = "precompiler_toolchain_impl",
  interpreter = ":bin/python3.10",
  precompiler_src = "@rules_python//tools/precompiler:precompiler.py"
)

py_precompiler_toolchain = rule(..., attrs = {
  "interpreter": attr.label(cfg="exec"),
  "precompiler_src": attr.label(cfg="exec"),
})

# In the maybe_precompile function
ctx.actions.run(
  executable = ctx.toolchains["//python:precompiler_toolchain_type"].interpreter
  args = [ctx.toolchains["//python:precompiler_toolchain_type"].precompiler_src] + [src.path, pyc.path],
  toolchain = "//python:precompiler_toolchain_type",
  ...
)

(there's variations of the above that would also work, but thats the gist).

@Ryang20718
Copy link

Was prototyping a bit. I really appreciate all the pointers here!

I used the system python interpreter to prototype for now and noticed some weird behavior and perhaps I'm missing something fundamental here, but when bazel "compiles" these python files into pyc, sometimes, we run into the case where the sandbox only contains the directory bazel-out. cding into the sandbox debug path, we see only bazel-out exists and no "external" directory exists for external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.py

  (cd /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/execroot/test_pyc && \
  exec env - \
    TMPDIR=/tmp \
  /home/ryang/.cache/bazel/_bazel_ryang/install/14fb027596f626f2526df4873ea20b8b/linux-sandbox -t 15 -w /dev/shm -w /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/execroot/test_pyc -w /tmp -e /tmp -S /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/stats.out -D /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/debug.out -- /bin/bash -c '/bin/python3.10 -m py_compile external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.py')
 ....
 

 ls -a
.  ..  bazel-out
 

Debugging this prototype, we see src.path for this situation is external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.py and pyc.path is bazel-out/k8-opt/bin/external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.cpython-310.pyc

Is there something fundamental that I'm missing?

Prototype:

def maybe_precompile(ctx, srcs):
    pycs = []
    for src in srcs:
        if "site-packages" not in src.path:
            pycs.append(src)
            continue


        basename = src.basename
        dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]

        pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
        pyc = ctx.actions.declare_file(pyc_out)
        ctx.actions.run_shell(
            outputs = [pyc],
            command = "/bin/python3.10 -m py_compile " + src.path
        )

        pycs.append(pyc)
    return pycs
    ```

@rickeylev
Copy link
Collaborator

That action invocation doesn't look correct.

  • The src File object should be an input to the action. This ensures the file is available when the action runs.
  • The output path should also be passed to the command. I don't see how the py_compile module would know that the output pyc file is in the same location as where bazel is going to create it.

The second should result in an error from bazel about an output file not being created.

@adam-azarchs
Copy link
Author

adam-azarchs commented Mar 5, 2024

Worth emphasizing the second point there,

  • The output path should also be passed to the command. I don't see how the py_compile module would know that the output pyc file is in the same location as where bazel is going to create it.

You cannot assume that the tool can derive the output path from the input path, because for example that prefix bazel-out/k8-opt could also be bazel-out/k8-fastbuild or e.g. something like k8-opt-exec-2B5CBBC6 if there are configuration transitions happening, which there probably will be in some cases.

py_compile will just try to create it as a sibling of the source file in the input tree, or in a __pycache__ subdirectory, which will most likely fail due to that being a read-only filesystem within the sandbox. But I think it'll fail silently because python wants you to be able to use python libraries in directories you don't own.

@adam-azarchs
Copy link
Author

Another point worth noting here is that if you're using a python script to generate the .pyc then you can lean on the existing python toolchain configuration to find the python executable. This doesn't necessarily obviate the need for a separate compilation toolchain, because toolchain resolution in this case will give you a python interpreter that is compatible with the execution environment, but you might need something other than that to produce output compatible with the target environment.

@Ryang20718
Copy link

Ryang20718 commented Mar 6, 2024

Prototype done!

Thanks for all the pointers! not pretty, but it does generate all the pyc files! (in our case, we use python310)

def maybe_precompile(ctx, srcs):
    """Computes all the outputs (maybe precompiled) from the input srcs.

    See create_binary_semantics_struct for details about this function.

    Args:
        ctx: Rule ctx.
        srcs: List of Files; the inputs to maybe precompile.

    Returns:
        List of Files; the desired output files derived from the input sources.
    """

    pycs = []
    for src in srcs:
        if ctx.label.package != src.owner.package:
            continue
        if src.extension != "py":
            continue
        if "site-packages" not in src.path:
            pycs.append(src)
            continue

        basename = src.basename
        dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]

        pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
        pyc = ctx.actions.declare_file(pyc_out)
        command = "'from py_compile import compile ; compile(\"" + src.path + "\", cfile=\"" + pyc.path + "\")'"
        ctx.actions.run_shell(
            outputs = [pyc],
            inputs = [src],
            command = "export SOURCE_DATE_EPOCH=123 && /bin/python3.10 -c " + command
        )
        pycs.append(pyc)
        pycs.append(src)

    return pycs

@adam-azarchs
Copy link
Author

Can clean it up a bit by doing

args = ctx.actions.args()
args.add("-c")
args.add("""
from sys import argv
from py_compile import compile
compile(argv[1], cfile=argv[2])
""")
args.add(src)
args.add(pyc)
runtime=ctx.toolchains["@bazel_tools//tools/python:toolchain_type"].py3_runtime
ctx.actions.run(
    executable=runtime.interpreter,
    arguments=[args],
    env={"SOURCE_DATE_EPOCH": "123"},
    tools=[runtime.files]
    ...
)

Modulo issues with cross-compiling, that at least will ensure use of the right python executable for the current build configuration, avoids the pitfalls of string interpolation in command lines, and is probably a bit more efficient for bazel to analyze.

I'd also note that setting SOURCE_DATE_EPOCH like that isn't really a great solution here; better to set invalidation_mode to PycInvalidationMode.CHECKED_HASH or UNCHECKED_HASH (see discussion above) rather than using a fake timestamp that will either always or never result in recompilation.

@rickeylev
Copy link
Collaborator

Yay, a working prototype!

if "site-packages" not in src.path:

What is this line for? I'm guessing its just debug code because you're trying to isolate it to just the pip-generated files being processed?

Adam's comments are in the right direction for cleaning it up.

For an initial PR, the important parts are:

  • As discussed, use the hash-based invalidation setting
  • Get the executable from toolchain lookup
  • Use sys.implementation.cache_tag to get the cpython-xx magic tag part; works well enough to start with.

Before making it generally available we need to:

  • Using a file instead of string will probably work better. This also makes it easier use a different implementation using select.
  • PYTHONHASHSEED=0 should also be set when running the action to help guard against non-determinism
  • The magic tag should should also come from toolchain lookup.
  • Address target vs exec config mismatch. This enabled cross-building.

@adam-azarchs
Copy link
Author

To keep things simple my inclination would be to leave out the "magic tag" altogether, since it's optional and at least in the context of a hermetic build it's very important to have it (worst case you run a different interpreter at runtime and it just rejects and ignores the .pyc). Also, if we want to support .pyc-only mode, we can't be stuffing it into a __pycache__ subdirectory. In that respect I think ideally we'd want to be able to defer the decision of whether or not to go .pyc-only until later, e.g. a packaging action that might want to be able to build a .pyc-only "release" as well as a "debug" package that includes the .pys.

There's also the part of this where it needs to get added in to the py_library. Ideally we'd carry it through the PyInfo provider. If we want to support "pyc-only" mode on a per-library basis, one has to be a little careful here with transitive dependencies to keep track of which .py/.pyc files do/don't exist as a pair, so downstream consumers, e.g. packaging rules, could decide to

  1. Ship both .py and .pyc files.
  2. Ship .py files, but only .pyc files that don't have a corresponding .py
  3. Ship .pyc files, except in cases where a .pyc wasn't generated for whatever reason.

There's also the question of whether we want to support "optimized" (.pyo) mode, but TBH I'm not sure if anyone actually uses that anywhere so I wouldn't worry about it for initial implementation.

@rickeylev
Copy link
Collaborator

leave out the "magic tag" altogether, ... worst case you ... it just rejects and ignores the .pyc

I think its OK to leave it out for the initial version, but it's necessary in order to support multiple versions of Python in the same build. Otherwise, it'll be a build error because different actions will attempt to create the same file with different content.

if we want to support .pyc-only mode, we can't be stuffing it into a pycache subdirectory.

Good catch. This would put pyc-only builds and multi-version builds at odds. From what I'm reading, the foo.<magic>.pyc is only read from __pycache__ directories, but plain foo.pyc is still read from along side foo.py. We can't know if multiple versions are going to be used, so we can't decide automatically.

The options I see here are:

  1. For PYC_ONLY mode, document the potential risk with multiple versions in the same build. cest le vie
  2. For PYC_ONLY mode, use unchecked hash mode and put a no-op .py file so __pycache__ is used.
  3. Have separate PYC_ONLY_ONE_VERSION and PYC_ONLY_MULTI_VERSION settings. One could futz with the import system to make multi-version pyc-only work.

In any case, I think we can just target py+pyc initially.

In that respect I think ideally we'd want to be able to defer the decision of whether or not to go .pyc-only until later, e.g. a packaging action that might want to be able to build a .pyc-only "release" as well as a "debug" package that includes the .pys.

This is an interesting idea, but I'm not sure how it can be well accommodated. The two ways it could be done is using an aspect or transition. An aspect is the more natural way for e.g. a higher level rule, like a packaging one, to add extra information to its dependencies, but the issue with an aspect is it would conflict with the action performed by py_library. It could work OK if the py_library had any pyc-actions suppressed.

It might work better for such a rule to do a split transition.

Looking at how some other rules handle this might be informative (the case that comes to mind is c++; iirc, there's a combination of build flags, compiler flags, and providers that determine and provide debug information)

where it needs to get added in to the py_library. Ideally we'd carry it through the PyInfo provider.

Yes, it belongs in PyInfo.

iirc, maybe_precompile() will put it into PyInfo.transitive_srcs (it also goes into runfiles). I'm a bit mixed on this because I'm torn about whether transitive_srcs should only contain source files (.py files) or not. Things work OK with it having pyc files in there, and splitting out a separate depset just for pyc files seems overkill, i guess? I didn't find much indication that a lot of thought went into this question originally.

should we support "optimized" (.pyo) mode ... skip for initial implementation

Yeah, I agree. Skip for the initial implementation. Note that .pyo files were dropped and that info is part of the file name after the magic tag (the optimize arg control it). See https://peps.python.org/pep-0488/

This would be easy to add, though: just add a second arg to set the compilation optimization level and pass it onto the action.

@Ryang20718
Copy link

Ryang20718 commented Mar 8, 2024

leave out the "magic tag" altogether, ... worst case you ... it just rejects and ignores the .pyc
From what I'm reading, the foo..pyc is only read from pycache directories, but plain foo.pyc is still read from along side foo.py

😅 Wondering if there's docs for when plain foo.pyc should also be read (or perhaps I'm misinterpreting) ? Doing some testing, I'm not quite sure placing the same .py & .pyc file in the same dir actually will result in the pyc file being used.

toy example

in a directory we have the following

ls -a
.  ..  main.py  ryang.py  ryang.pyc

ryang.py:

def hello():
	print("hi")

ryang.pyc : manually generated pyc via compile

main.py

from ryang import hello

If we run the following file, we see from the trace that we don't actually read from the "magic tag"

python3 -v main.py

Python 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
# code object from /tmp/ryang.py
# created '/tmp/__pycache__/ryang.cpython-310.pyc'
import 'ryang' # <_frozen_importlib_external.SourceFileLoader object at 0x7f1c292c00a0>

What is this line for?

yeah 😓

@adam-azarchs
Copy link
Author

Wondering if there's docs for when plain foo.pyc should also be read

pep-3147 has a handy flow-chart:
pep-3147

This is an interesting idea, but I'm not sure how it can be well accommodated.

What I had envisioned looks more or less like this:

  • PyInfo maintains four sets of transitive sources:
    1. py-only sources
    2. pyc-only sources
    3. py sources where there's also a pyc
    4. pyc sources where there's also a py
  • Down-stream consumer rules always consume 1 and 2, but have a choice whether to take 3, 4, or both.

@Ryang20718
Copy link

ah I see, python will only import the pyc in the same directory as the py file if and only if the original py file doesn't exist.

For the consumer speedup at import time, we would only get speedups (meaning no need to generate a pyc file again) for the following

  • PyInfo maintains four sets of transitive sources:
  1. py-only sources --> no speedup
  2. pyc-only sources --> speedup
  3. py sources where there's also a pyc --> no speedup
  4. pyc sources where there's also a py --> no speedup

Wondering what downstream users would use options 3/4 for?

@adam-azarchs
Copy link
Author

adam-azarchs commented Mar 14, 2024

That's not quite true. I think it'll still foo.pyc from the same directory as foo.py if there isn't a __pycache__/ and it isn't able to create one, which would normally be the case in a bazel sandbox.

Regardless, if you are using the __pycache__ path, you you should be seeing a speedup so long as you're generating it correctly. Meaning, not using timestamp-based invalidation. Setting SOURCE_DATE_EPOCH will make the timestamp reproducible but will also make the timestamp always be older than the source file, so python won't ever use the .pyc. Using CHECKED_HASH, it doesn't need to parse the source file but does still need to read it, so the speedup on a modern system that is usually I/O-bound will be minimal. Though, in a lot of bazel contexts it may be putting everything on tmpfs in which case the difference will be more noticeable. UNCHECKED_HASH should see a speedup regardless of whether the .py sources are present.

@Ryang20718
Copy link

UNCHECKED_HASH should see a speedup regardless of whether the .py sources are present.

hm... maybe my testing methodology is wrong or perhaps I'm missing something?

Using the site-packages for debugging purposes, with pyc only, I see a speedup for bazel test (we're using linux-sandbox) spawn strategy

i.e

import torch
pyc only

def maybe_precompile(ctx, srcs):
"""Computes all the outputs (maybe precompiled) from the input srcs.

See create_binary_semantics_struct for details about this function.

Args:
    ctx: Rule ctx.
    srcs: List of Files; the inputs to maybe precompile.

Returns:
    List of Files; the desired output files derived from the input sources.
"""
pycs = []
for src in srcs:
    if "site-packages" not in src.path:
        pycs.append(src)
        continue
    basename = src.basename
    dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]

    # pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
    pyc_out = dirname + "/" + basename.replace(".py", ".pyc")
    pyc = ctx.actions.declare_file(pyc_out)
    args = ctx.actions.args()
    args.add("-c")
    args.add("""from sys import argv;from py_compile import compile, PycInvalidationMode;compile(argv[1], cfile=argv[2], invalidation_mode=PycInvalidationMode.UNCHECKED_HASH)""")
    args.add(src)
    args.add(pyc)
    runtime=ctx.toolchains["@bazel_tools//tools/python:toolchain_type"].py3_runtime
    ctx.actions.run(
        inputs = [src],
        outputs = [pyc],
        executable=runtime.interpreter,
        arguments=[args],
        tools=[runtime.files]
    )
    # pycs.append(src)
    pycs.append(pyc)
return pycs

pyc + py doesn't see a speedup. Even if I set the interpreter to use PYTHONDONTWRITEBYTECODE=x which disables pyc creation upon import, I don't see a speedup 🤔

pyc + py

def maybe_precompile(ctx, srcs):
"""Computes all the outputs (maybe precompiled) from the input srcs.

See create_binary_semantics_struct for details about this function.

Args:
    ctx: Rule ctx.
    srcs: List of Files; the inputs to maybe precompile.

Returns:
    List of Files; the desired output files derived from the input sources.
"""
pycs = []
for src in srcs:
    if "site-packages" not in src.path:
        pycs.append(src)
        continue
    basename = src.basename
    dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]

    # pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
    pyc_out = dirname + "/" + basename.replace(".py", ".pyc")
    pyc = ctx.actions.declare_file(pyc_out)
    args = ctx.actions.args()
    args.add("-c")
    args.add("""from sys import argv;from py_compile import compile, PycInvalidationMode;compile(argv[1], cfile=argv[2], invalidation_mode=PycInvalidationMode.UNCHECKED_HASH)""")
    args.add(src)
    args.add(pyc)
    runtime=ctx.toolchains["@bazel_tools//tools/python:toolchain_type"].py3_runtime
    ctx.actions.run(
        inputs = [src],
        outputs = [pyc],
        executable=runtime.interpreter,
        arguments=[args],
        tools=[runtime.files]
    )
    pycs.append(src)
    pycs.append(pyc)
return pycs

@Ryang20718
Copy link

image

😅 I was under the impression that py + pyc in the same directory would not result in speedups https://peps.python.org/pep-3147/

@adam-azarchs
Copy link
Author

No, you're right, I missed that bit. So if .py is present, .pyc must be in __pycache__, and otherwise it can't be. That does certainly make it more difficult to defer the decision of pyc-only vs py+pyc.

@Ryang20718
Copy link

Ryang20718 commented Mar 15, 2024

I think we have the following paths:

  • pyc + py --> we'd need to use the magic tag + pycache directory
  • pyc only --> no need for magic tag, but we can only support one version of python
  • py only --> default behavior

Going back to why would pyc be useful and tradeoffs of generating pyc, I think the pros / cons of each decision would be

Situation Pros Cons
pyc + py import speedups complexity with magic tags
pyc simple to implement issues with multiple python version and would probably require some sort of opt in functionality since this might break downstream consumers
py default behavior default behavior

Coming at this from a python rules consumer perspective, It would seem the first option of pyc + py just moves the pyc creation stage earlier rather than at runtime (import time)

Default behavior (unless one uses a cli arg otherwise, is to generate pyc files in pycache upon import)
https://docs.python.org/3/using/cmdline.html#cmdoption-B

Wondering what the use case for pyc only is (just curious 😅)?

@adam-azarchs
Copy link
Author

There's basically three use cases for pyc-only.

  1. The bad one is if you're trying to keep your source code secret from a casual investigator. It won't do a great job of that, because reverse-compiling is pretty easy, but at least keeps search tools from indexing it.

  2. Maybe you just want to put a speed bump in the way of your customers who know just enough python to get themselves into trouble making changes to your code to "fix" things they perceive as bugs and then sending problem reports to your support team with stack traces that don't make any sense any more. Not that I'd have any experience with that sort of thing...

  3. The third, and probably best, use case would be to reduce the size and file count of the package you are distributing, particularly for deployment in k8s or serverless contexts. It would also probably help in a bazel test context, particularly with remote execution. That's the kind of use case where you might want pyc-only for an opt build but would want to keep the sources around for a dbg build.

@Ryang20718
Copy link

Assuming we want to solve for the use case of

  • pyc only ( will need to document caveats for multi python version build environments)
  • pyc + py (default behavior)

Would consumers have something like? As well as some sort of flag that can set this attribute for all py_libraries? i.e

PY_SRCS_ATTRS = union_attrs(
?

py_library(
   name = "<name>",
   srcs = ["src.py"],
   deps = ["<deps>"],
   pyc_only = true, # default is pyc + py? (using magic tag + pycache)
)

If we were to do it this way, I think we'd need to make changes in a couple of areas? (haven't worked too much with starlark rules, so I may be doing something wrong here)

Would we need to

  • add a field for pyInfo here to allow users to select pyc or pyc +py
  • modify transitive depsets field to filter for files based on pyc or pyc +py setting
  • modify Attributes here to accept this new pyc_only mode?

@adam-azarchs
Copy link
Author

I don't think we'd want the attribute to be boolean; as discussed above I think there's probably 4 modes of interest:

  1. No compilation (.py-only)
  2. .py+__pycache__/*.pyc with CHECKED_HASH mode.
  3. .py+__pycache__/*.pyc with UNCHECKED_HASH mode.
  4. .pyc-only.

The difference between CHECKED_HASH and UNCHECKED_HASH is probably something we'd want to set at the toolchain level, though I can imagine situations where you might want to be more selective.

I'd probably call it compile_mode or maybe pyc_mode or even just pyc.

It would probably make sense for the initial default to be 1, for backwards compatibility, though really we'd also probably want the default to be "default", meaning use whatever was set on the toolchain configuration1, so users can configure it globally but override locally.

So, the place to start would probably be with the toolchain configuration. And for an MVP at least, we can just leave it there and not yet expose options on a per-library basis.

We don't actually have to change PyInfo if we do it this way (non-deferred decision as to whether to compile).

However, if we allow ourselves to make changes to PyInfo, we can defer the mode decision until later on. In that case, we'd need PyInfo to keep track depsets of transitive files for

  1. pyc-only distribution: .pyc files where the .py files not in __pycache__, except for targets where compilation was disabled, in which case we need the .py files.
  2. py-only distribution files. This would include any pyc files which might have been configured for pyc-only mode, e.g. something the author is especially interested in obfuscating.
  3. __pycache__/*.pyc files corresponding to .py files in 3. Downstream targets that wanted py+pyc mode would use this together with 3.
  4. (optional) __pycache__/*.pyc for .py files where pyc-only was disabled but py+pyc was allowed, if we want to support that. A pyc-only distribution might choose to use this or not depending on their size/speed tradeoff calculation. Or we could just include this in 1 unconditionally. It probably isn't worth supporting this case, but listing it here for completeness.

A distribution could use either 1 (mostly pyc-only), 1+4 (mostly pyc-only plus compilation of whatever isn't), 2 (mostly-py-only), or 2+3 (mostly-py+pyc).

Footnotes

  1. Typically, defaults that delegate to some sort of global configuration would be using None, however a user would be justifiably confused by compile_mode = None resulting in compilation.

@rickeylev
Copy link
Collaborator

Yes, have a string-valued setting. The default should be "default", or "auto", or some other sentinel. This allows easier detecting of "did someone set this?", plus some freedom in implementation.

who sets checked vs unchecked hash mode? other more selective cases

Yeah, this is a tricky question. For "released" code, unchecked hash is the obvious choice. When doing development, you want checked hash (or disable pyc entirely) for sources you're going to modify, and unchecked for ones you haven't.

This also makes me think that, for something like pip.parse-generated libraries, it would make sense for those to set unchecked hash (i.e. a library-level setting), because you're really unlikely to modify those (if you are, you might as well just modify the generated build file, too).

Another case that comes to is generated code (e.g. py_proto_library). Such code has to go through a build step to see changes anyways, so adding an extra call to precompile isn't that much overhead. (In the case of py_proto_library, because protos tends to generate a lot of files, skipping the py->pyc is a net win).

I can imagine some extra set of flags to control behavior here from the "outside" (I'm thinking a label flag that points to a richer config), but lets keep it simple for now -- I think a library level setting for this makes sense (with the 5 modes listed: auto, py-only, pyc-only, pyc-checked-hash, pyc-unchecked-hash).

@adam-azarchs
Copy link
Author

A trickier case to handle where you'd want to use UNCHECKED_HASH is unit tests, for which bazel should always be guaranteeing the .pyc is up to date (assuming they're running in either sandbox or remote), so you might as well take that optimization.

@Ryang20718
Copy link

Was doing some prototyping to pipeline this pyc_mode through to PyInfo and had a couple of questions

I notice that in areas wheremaybe_precompile is used, PyInfo is generated with the outputs of maybe_precompile. I.e here

Assuming we'd want the end user to be able to do something like

py_library(
    name = "blah",
    srcs = ["a.py"],
    ...
    pyc_mode = "pyc_only",
)

Would we need to propagate from PY_SRCS_ATTRS? Or is there a way to do this solely via PyInfo?

  1. For creating a pre-compiler toolchain, after reading some docs + Richard's suggestions, I was wondering if there's a way to pass in the interpreter from the current python toolchain when instantiating this precompiler toolchain?

Currently, I just use a sh_binary which wraps a local python, but ideally it would match the python toolchain defined?

def _py_precompiler_toolchain_impl(ctx):
    return [platform_common.ToolchainInfo(
        interpreter = ctx.attr.interpreter,
        precompiler_src = ctx.attr.precompiler_src,
        python_version = ctx.attr.python_version,
    )]

py_precompiler_toolchain = rule(
    doc = """
    This rule exists so that the precompiler toolchain can be set for precompiling python files to pyc without.
    This cannot be a regular py_binary, otherwise there will be a circular dependency.
    """,
    implementation = _py_precompiler_toolchain_impl,
    attrs = {
        "interpreter": attr.label(cfg="exec"),
        "precompiler_src": attr.label(cfg="exec", default = Label("//tools/precompiler:precompiler_py")),
        "python_version": attr.string(doc = "python version. I.e 310 for python 3.10. Used for generating magic tag for pyc file", mandatory = True),
    },
)

  1. Multiple Python Versions. we currently haven't yet migrated to bzlmod. we also aren't using multiple python versions so this comment may be out of date.

Does bzlmod support pytest with multiple pip/python versions? I see there's this comment here

@rickeylev
Copy link
Collaborator

Coincidentally, we have some internal work in this area planned, so I thought I'd post some of our analysis that seemed relevant.

Note that this isn't documenting a decision. It's merely trying to inform from looking at the use case we're trying to solve (in brief, precompiling enabled by default) and the constraints we have.


A precompile attribute with 5 values seems very useful:

  • all: include both py and pyc files
  • only: include only pyc files
  • none: include only py files
  • inherit: pick one of the above from a flag
  • always: always include a pyc file, but decide if py is included from a flag

It seems like there are 3 flags to decide behavior?

  • --precompile_default: used when a target doesn't have a value set (or when its inherit). It would have values:

    • all, only, none: same as for the attribute
    • generated_only: For generated files, acts as "only". For source files, acts as "none".
    • generated_all: For generated files, as as "all", for source files, as as "none"
  • --precompile_force: Like --precompile_default but overrides target level settings. Probably the same value as --precompile_default. (The use cases for this seem valid but niche, see below).

  • --precompile_force_retain_sources: Forces that py sources are retained. e.g. precompile=only acts like precompile=all (mostly an alternative to precompile_force, see below)


re: flag vs attribute precedence: this is a really tough question. We could come up with reasonable scenarios where both are sensible behaviors. Overall, the attribute taking precedence seems like the better default behavior. The main case for having the flag take precedence is: what if a target you don't own sets precompile=only (having just pyc makes sense if it just generates code), but you have to deploy to an environment where that pyc won't be valid (i.e. your build is for Python 3.10, but you deploy to something with python 3.9). This is somewhat of an edge case for us, but seems like it would be much more common in OSS. This case is also where the --precompile_force_retain_sources flag comes from: by forcing sources to be kept, the invalid pyc wouldn't be an issue in practice.


The above case also implies that there's a distinction between binaries and libraries in this regard (i.e. a binary level-setting applies transitively). As a general rule, if there's a global flag to set the default or force the state, then it probably should also be a binary-level attribute. It's not always possible or desirable to set a global flag, especially if there's just a handful of targets that need it.

That said, a binary level setting also comes with problems. As an example, given:

  • Binary setting has precedence
  • binary sets precompile=all
  • library elsewhere sets precompile=only
  • Transitions are not allowed to be used (performance reasons)

Then the problem is the library can't directly see what the binary needs, and only outputs foo.pyc, when it should have output both foo.py and __pycache__/foo.pyc. We think an aspect can address this, but its illustrative of some of the caveats and complex interactions that can happen (and also how transitions make it easy for an unfortunately high cost).

@rickeylev
Copy link
Collaborator

Would we need to propagate from PY_SRCS_ATTRS? Or is there a way to do this solely via PyInfo?

I'm not sure I understand the question. What goes into PyInfo is derived from the target attributes (srcs and deps, usually). The core logic, when unrolled, should basically be:

PyInfo(
  direct_sources = ctx.files.srcs,
  transitive_sources = direct_sources + [d[PyInfo].transitive_sources for d in deps],
  direct_pyc = get_precompile_outputs(direct_sources),
  transitive_pyc = direct_pycs + [d[PyInfo].transitive_pyc for d in deps]
  ...
)

Does that answer your question?

a way to pass the current python toolchain interpreter in

I think you're hinting at the issue of you need an exec-config interpreter, but the python toolchain only has a target-config interpreter?

I think your 3 basic options are:

  1. Use manually defined exec groups, see https://bazel.build/extending/exec-groups. A helper target is probably necessary, e.g. //python:current_py_toolchain or similar.
  2. Use auto exec groups, see https://bazel.build/extending/auto-exec-groups. I think this requires a separate toolchain type with cfg=exec parts.
  3. Create a "py_exec_interpreter_binary" rule and have an implicit, cfg=exec, dependency on it. All it does is create an executable that acts like the interpreter. This is akin to what you're doing with sh_binary, except a dedicated rule to do it instead.

I think a combination of (2) and (3) is probably the best? (2) invokes toolchain resolution, which permits the target config to influence selection (i.e. allows the target config pythonversion=3.10 to select a toolchain that can generate 3.10 bytecode). (3) allows abstracting how the precompiler binary is implemented (i.e. it could be a pre-compiled binary or simply lookup the "regular" interpreter).

ideally it'd match the current toolchain

Yes, but recall how toolchain resolution and matching works: it's taking the key:value constraints and looking over lists to find something that matches. I say this because it sounds like you might be thinking/desiring this: regular_toolchain = match(X, "regular"); precompile_toolchain = create_precompile_toolchain_from(regular_toolchain), but that isn't how toolchains operate. It's more like this: regular_toolchain = match(X, "regular"); precompile_toolchain = match(X, "precompile").

Does bzlmod support pytest with multiple pip/python versions?

Yes, it supports multiple versions. That comment is out of date.

@adam-azarchs
Copy link
Author

Then the problem is the library can't directly see what the binary needs, and only outputs foo.pyc, when it should have output both foo.py and __pycache__/foo.pyc. We think an aspect can address this, but its illustrative of some of the caveats and complex interactions that can happen (and also how transitions make it easy for an unfortunately high cost).

This can be avoided if we're willing to complicate the PyInfo to separately track sets of files for each use case as I suggested in #1761 (comment). This is probably not a great approach, however, as there are probably a lot of existing rules out there that propagate PyInfo and might start silently dropping these additional fields.

@rickeylev
Copy link
Collaborator

rickeylev commented May 12, 2024

I've been working on adding this functionality. You can see the WIP here: main...rickeylev:rules_python:precompile.toolchain

Everything appears to work, so I've started cleaning it up.

The basic gist of what I have implemented is:

  • A new toolchain to carry an exec-config interpreter and precompiler tool.
  • The precompiler can be either run with the exec-config interpreter, or be a prebuilt binary.
  • The precompiler that comes with rules_python is a fully multiplexed, sandbox-aware,
    persistent worker. (I started on this path for fun, but then it turned out to be a
    huge performance benefit -- 60+ seconds vs 4 seconds when building the 2000 file
    bzlmod example)
  • Provider fields PyInfo.transitive_pyc_files, PyInfo.direct_pyc_files
  • Several new attribute and flags to control things:
    • precompile=inherit|enabled|disabled|if_generated_source: controls if a target
      performs precompilation.
    • --precompile=auto|enabled|disabled|if_generated_source|force_enabled|force_disabled
    • precompile_invalidation_mode=auto|checked_hash|unchecked_hash: determines what invalidation mode to use.
    • precompile_optimize_level=<int>: Controls the optimization level.
    • --precompile_source_retention=auto|keep_source|omit_source|omit_if_generated_source: controls if the py is included in the output or not.
    • (binaries/tests only) pyc_collection: controls if binaries explicitly include pyc files
      from dependencies. This derives from the Google situation, where controlling behavior
      at the binary level makes it easier to incrementally roll out the change.
    • --precompile_add_to_runfiles=auto|always|decided_elsewhere: controls if the pyc files
      are added to the runfiles of the generating target (e.g. if py_library adds its pyc
      files to the runfiles it returns). This is used in combination with pyc_collection.

The two basic ways to use precompilation are:

  1. Each target (e.g. py_library) compiles its files and they're included in runfiles and it Just Works. This will eventually be the default mode.
  2. Libraries can create compiled files, but not include them in the runfiles. Binaries
    can then opt-into including the pyc files. This is more for the Google situation, but
    really any situation where enabling precompiling at large doesn't Just Work.

If you like optimizing Python code, then optimizing the precompiler tool might be fun. Unfortunately, precompiling adds a decent amount of overhead. As an example, the bzlmod example has about 2,000 py files. This takes about 60+ seconds to precompile as a regular action (1 py file per action) 🙁.

So I implemented a second precompiler that is a multiplexed, sandbox-aware, persistent worker. This reduced build time from 60s to about 4 seconds (yes, single-digit). I think we can do better, but I haven't explored where the bottleneck is. If we can change it to be a prebuilt binary, then we'd have more options for optimizing it, too.

@adam-azarchs
Copy link
Author

A few observations:

Persistent worker

This approach worries me a bit because there's a lot of rough edges around using a persistent worker with remote execution (which my team uses exclusively).

Python interpreter startup time is a lot; it's not surprising that you get significant benefits from a persistent worker, especially since AFAKT the implementation for py_compile is to actually fully import the module, which ends up processing all of its transitive imports etc. Could probably improve on that very significantly with a purpose-built binary. I haven't actually looked into how hard that would be to write, but at first glance at least if you're allowed to link libpython it looks like you'd need to

  1. PyImport_GetMagicNumber for the header
  2. Add the flags
  3. Add source hash. This part is maybe the tricky part, seems like _imp_source_hash is the right call to make here but that's a private symbol. You could go through the python extension module _imp.source_hash but at that point I'm not sure how much benefit from avoiding a full interpreter is preserved...
  4. call Py_CompileStringObject to get the code object
  5. PyMarshal_WriteObjectToString to serialize it

pyc-only mode

I'm not sure this works in if precompile="disabled" is set on any dependencies. I think you can get around that by adding the sources into the "pyc" depset in that situation, though that muddies the meaning a bit.

You're also not supporting the use case of permitting specific libraries to be pyc-only without setting it overall. This could be done; similarly to precompile=disabled you'd need to add the pyc to the normal sources as well as as pycs in order to support cases that were not collecting pyc.

@rickeylev
Copy link
Collaborator

re: persistent worker: Yes, all good points. The precompiler works as a non-worker, regular worker, or multiplexed worker, and there's a flag to set the execution_requirements for the precompile action.

re: py_compile fully imports: It doesn't import (execute) the py code. It just uses the importlib libraries to read the file and compile it to pyc. I found that surprising, but maybe not so much if this is closely coupled to the import system's processing of things (e.g. _imp_source_hash being part of import.c).

re: linking with libpython: Yes, that'd be awesome. I'd love to get a cc_binary that statically links with libpython working. For something like the precompiler, which is a single file that only uses the stdlib, such a thing seems doable.

re: library-level pyc-only: Ah yeah, I forgot about that. That should be doable with target-level attribute to control that.

@adam-azarchs
Copy link
Author

I forgot that you can't actually choose to be pyc-only at the py_binary level because the file naming requirements are different for pyc-only vs. py+pyc. So that decision needs to be done at the py_library level, but can have a global default at least.

I'll maybe take a time-boxed crack at trying to get such a cc_binary working tomorrow.

@rickeylev
Copy link
Collaborator

re: remote builds, performance, persistent workers: A thought: Right now, each source file is processed with a separate action. Some performance could be regained with non-worker execution by sending all of a target's sources to a single invocation. This would at least change the overhead to once-per-target instead of once-per-file. For targets with many sources (e.g. something from pypi), this would probably help a decent amount.

@adam-azarchs
Copy link
Author

This would at least change the overhead to once-per-target instead of once-per-file.

One of our repos using remote execution:

$ bazel query 'kind(py_library, //...) + kind(py_test, //...) + kind(py_binary, //...)' | wc -l
2660

@adam-azarchs
Copy link
Author

Didn't really have time to figure out a truly portable way to do the source hashing, but I did try a different, much simpler python implementation:

from sys import argv
import py_compile

src_path, src_name, dest, optimize, checked = argv[1:]

py_compile.compile(
    src_path,
    cfile=dest,
    dfile=src_name,
    optimize=int(optimize),
    invalidation_mode=(
        py_compile.PycInvalidationMode.CHECKED_HASH
        if bool(int(checked))
        else py_compile.PycInvalidationMode.UNCHECKED_HASH
    ),
)

This avoids argparse importing of all of the stuff you need for a persistent worker, and seems to reliably take less than half the time as your precompile.py when invoked on a single file. This isn't that surprising; what argparse is doing is pretty expensive if you're going to be doing it a huge number of times, though it's worth it for a longer-lived process.

$ time (for f in $(git ls-files '*.py'); do python3 tools/precompiler/precompiler.py --src $f --src_name $f --pyc /tmp/temp2.pyc --invalidation_mode UNCHECKED_HASH --optimize=0; done)

real    0m10.133s
user    0m8.744s
sys     0m1.381s
$ time (for f in $(git ls-files '*.py'); do python3 tools/precompiler/precompiler2.py $f $f /tmp/temp.pyc 0 0; done)

real    0m3.492s
user    0m2.867s
sys     0m0.600s

This does suggest that the persistent worker implementation should at the very least be kept separate from the single-file implementation. Or at least should process imports lazily as-needed.

@rickeylev
Copy link
Collaborator

Oh, nice analysis. I see similar. All those unused imports add up.

re: lazy imports: Yeah, I'll do this, at the least. This is an easy change to make.

re: splitting worker impl into another file: That's a good idea, too; it would avoid having to parse all the worker-related code. It'll have to wait for another PR, though.

The argparse import itself isn't too bad, so I'm going to keep the --flag based calling convention. Actually, its more an @file convention with the args within in. We don't know if the action will use a worker or not, so the args have to always be passed to the param file.

github-merge-queue bot pushed a commit that referenced this issue May 18, 2024
This implements precompiling: performing Python source to byte code
compilation at build time. This allows improved program startup time by
allowing the byte code compilation step to be skipped at runtime.

Precompiling is disabled by default, for now. A subsequent release will
enable it by default. This allows the necessary flags and attributes to
become available so users can opt-out prior to it being enabled by
default. Similarly, `//python:features.bzl` is introduced to allow
feature detection.

This implementation is made to serve a variety of use cases, so there
are several attributes and flags to control behavior. The main use cases
being served are:
* Large mono-repos that need to incrementally enable/disable
precompiling.
* Remote execution builds, where persistent workers aren't easily
available.
* Environments where toolchains are custom defined instead of using the
ones created by rules_python.

To that end, there are several attributes and flags to control behavior,
and the toolchains allow customizing the tools used.

Fixes #1761
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants