-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to use parallelism when building the wheels? #1088
Comments
You can use the github action grid, or some equivalent feature in the CI system you use. I have done it recently on setproctitle and build went down from hours to 10 minutes. The result is some 30 parallel jobs. |
That does help, but the individual jobs still take 10 minutes to complete. When I build the wheel locally using bazel, after clearing bazel's cache, the build and test finishes in 1 minute. So there's still something making things very slow in cibuildwheels compared to other strategies for building the wheel. And a likely candidate is that each worker is only using one thread to build the C++ code, so it goes one... file... at... a... time... 10 minutes on github (the worst one takes 20 minutes but 10 seems to be the common value): 1 minute locally (on a 12 core machine): |
CI runners have at most 2 cores on most CI's, and run on shared resources. So that's 6x slower than local, or the equivalent of 6 minutes local time. Plus cibuildwheel is downloading Python and setting up multiple virtual environments, downloading dependencies, running tests, etc - easily can be the remaining up to 4 minutes. So that sounds perfectly reasonable. Your build should be using both available cores as long as you've set it up to do that. Cibuildwheel in parallel would not give you much at all. Though I guess you could do it yourself with |
If you actually look at what is taking time, it is not downloading dependencies or running tests. In one of my "fast" cases, the setup and teardown takes about a minute and the tests also take about a minute. Which leaves 8 minutes of building: I already am only doing one wheel build per worker, so I'll get no benefit from two invocations. What I need is for cibuildwheel to use two cores for one build, so that it is not processing one C++ file at a time. I did realize that my worse offenders are actually building numpy and pandas, in addition to my wheel, which is why they take like 40 minutes instead of 10-20. The actual time spent building my wheel is 12.5 minutes. Maybe I should go poke numpy and pandas to have wheels for |
At least numpy and pandas do build in parallel. Setuptools doesn't directly support it (unless you have multiple extensions), but it's easy to patch in, pybind11 and numpy both have utilities for it (and of course CMake via Scikit-build, etc. all support it). As long as you are doing that, you aren't waisting that much time not running cibuildwheel in two threads. Yes, if you provide wheels for something your dependencies don't, not sure it's very helpful. Users will have to build numpy & pandas to use your packaged binary. Also make sure you are using the same manylinux family they are using or better. (like manylinux2014 for Python 3.10, which we do default to these days). If you go older, you'll have to build them if you use them. |
I don't think currently that it makes sense to provide build parallelism within cibuildwheel. There are already options to do it at a lower level (compiler flags) or at a higher level (CI build matrices). Adding it in cibuildwheel is probably not going to be worth the complexity. |
One thing that potentially could improve build performance would be to be a bit cleverer about the network IO. E.g. we could download the next docker image or Python version while the previous build is running. That might save a little time. But again, would it be worth the added complexity? Not sure... |
I like the high level (CI build matrices) solution, I can split out arch and linux flavour like below. But since QEMU is painfully slow for aarch64, cibuildwheel runs a bunch of tests on the wheels after building, the wait for aarch64 to finish on the below concurrency 2 is still resulting in hours of CI. Solution is to additionally split out python versions. I would however prefer not to hardcode python versions in CI, rather to let the cibuildwheel
Edit: found a more or less neat way: updated snippet below to evenly distribute current and future python releases over 5 build jobs. Assuming a package that supports 3.6+ (and stable cibuildwheel currently distributing up to # gh actions
jobs:
build-wheels:
runs-on: ${{ matrix.os }}
strategy:
matrix:
include:
- {os: macos-latest, arch: x86_64, build: "*"}
- {os: macos-latest, arch: arm64, build: "*"}
- {os: windows-latest, arch: AMD64, build: "*"}
- {os: windows-latest, arch: x86, build: "*"}
- {os: ubuntu-latest, arch: x86_64, build: "*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[61]-manylinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[72]-manylinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[83]-manylinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[94]-manylinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[05]-manylinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[61]-musllinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[72]-musllinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[83]-musllinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[94]-musllinux*"}
- {os: ubuntu-latest, arch: aarch64, build: "*[05]-musllinux*"}
steps:
- uses: docker/setup-qemu-action@v2
if: matrix.os == 'ubuntu-latest'
- uses: pypa/cibuildwheel@v2.10.0
env:
CIBW_BUILD_VERBOSITY: 1
CIBW_ARCHS: ${{ matrix.arch }}
CIBW_BUILD: ${{ matrix.build }} |
(Though that's kind of neat too) |
Waiting for cibuildwheel to finish is really painful, even for a single wheel. Is there a flag I can use in order to force it to use multiple threads when building? For example, when building code with
make
I can specify--jobs 8
to build 8 source files at a time instead of 1 at a time.For scale, cibuildwheel is one hundred (!!!!!) times slower than all the other checks I do. Being able to reduce that to 10x slower by using parallelism would be hugely helpful.
The text was updated successfully, but these errors were encountered: