Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce wheel package size for faiss-gpu CUDA 11.0 build #57

Open
kyamagu opened this issue Apr 11, 2022 · 19 comments
Open

Reduce wheel package size for faiss-gpu CUDA 11.0 build #57

kyamagu opened this issue Apr 11, 2022 · 19 comments
Labels
enhancement New feature or request

Comments

@kyamagu
Copy link
Owner

kyamagu commented Apr 11, 2022

The CUDA 11.0 build in #56 bloats the wheel package size from 85.5 MB to 216.5 MB. Needs to investigate file size reduction.

@kyamagu kyamagu added the enhancement New feature or request label Apr 11, 2022
@kyamagu
Copy link
Owner Author

kyamagu commented Apr 11, 2022

Relevant pytorch/pytorch#56055

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 11, 2022

Seems one approach is to drop architecture-specific binary in CUDA libraries via nvprune, like this:

nvprune \
  -gencode arch=compute_60,code=sm_60 \
  -gencode arch=compute_70,code=sm_70 \
  -gencode arch=compute_75,code=sm_75 \
  -gencode arch=compute_80,code=sm_80 \
  -gencode arch=compute_80,code=compute_80 \
  -o /usr/local/cuda/lib64/libcublas_static_slim.a \
  /usr/local/cuda/lib64/libcublas_static.a

Currently there are four dependencies, and applying nvprune slightly reduces the binary size.

  • libcublas_static.a
  • libcublasLt_static.a
  • libcudart_static.a
  • libculibos.a

In Python 3.9, the original file size of _swigfaiss.cpython-39-x86_64-linux-gnu.so was 341MB, while applying nvprune to all the static libs results in 310MB. This is still huge.

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 11, 2022

The major problem is that CUDA 11.0 splits cublasLt API into a different static lib, and that seems to significantly increase the final binary size. In CUDA 10.x, cublasLt API was within the single static lib.

libcublasLt_static.a 224M
libcublas_static.a 82M
libcudart_static.a 910K
libculibos.a 31K

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 11, 2022

Strangely, faiss does not use cublasLt API. But when omitting -lcublasLt_static in the linker flag of setup.py, we see the following error on import. Why does that happen?

ImportError: /workspace/faiss-wheels/build/lib.linux-x86_64-3.9/faiss/_swigfaiss.cpython-39-x86_64-linux-gnu.so: undefined symbol: cublasLtMatrixTransformDescDestroy

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 11, 2022

Ok, changing the order of linker flag in setup.py seems to reduce the binary size.

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 12, 2022

With CUDA 11.6, the resulting wheel further goes up to 345MB in Linux. After nvprune, we get 276MB. This is still not good, as PyPI default limit is 60MB.

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 12, 2022

Alternative is to give up static linking and relies on dynamic linking. This will significantly reduce the wheel size, while requires users to install CUDA runtime libraries elsewhere.

@kyamagu
Copy link
Owner Author

kyamagu commented Nov 17, 2022

With avx2 extension, the package is ~430MB.

@kyamagu
Copy link
Owner Author

kyamagu commented Jan 5, 2023

It seems there are CUDA runtime packages on PyPI.
https://pypi.org/project/nvidia-cuda-runtime-cu11/

@theLastOfCats
Copy link

Hi!

Did you consider to place package on GitLab PyPI index or place it to dockerhub as image?

ping me if you need help

@kyamagu
Copy link
Owner Author

kyamagu commented Mar 16, 2023

@theLastOfCats You can manually download packages from the release page.

@Di-Is
Copy link

Di-Is commented Apr 22, 2024

Hi @kyamagu!

For your reference, by changing from static linking to dynamic linking of CUDA, the wheel size has been reduced to 63MB.
It was dynamically linked with the shared libraries of the nvidia-cublas-cu12 and nvidia-cuda-runtime-cu12 packages, which are published on PyPi

It seems possible to reduce the wheel size to less than 60MB by either narrowing down the target architecture or switching from static linking to dynamic linking of OpenBLAS.

Fork Repository: https://github.com/Di-Is/faiss-wheels/tree/pypi-cuda

Build Script
# Test CMD
CPU_TEST_CMD="pytest {project}/faiss/tests && pytest -s {project}/faiss/tests/torch_test_contrib.py"
GPU_TEST_CMD="cp {project}/faiss/tests/common_faiss_tests.py {project}/faiss/faiss/gpu/test/ && pytest {project}/faiss/faiss/gpu/test/test_*.py && pytest {project}/faiss/faiss/gpu/test/torch_*.py"

# Common Setup
export CIBW_BEFORE_ALL="bash scripts/build_Linux.sh"
export CIBW_TEST_COMMAND="${CPU_TEST_CMD}"
export CIBW_BEFORE_TEST_LINUX="pip install torch --index-url https://download.pytorch.org/whl/cpu"
export CIBW_ENVIRONMENT_LINUX="FAISS_OPT_LEVEL=${FAISS_OPT_LEVEL:-generic} BUILD_PARALLELISM=${BUILD_PARALLELISM:-3} CUDA_VERSION=12.1"
export CIBW_DEBUG_KEEP_CONTAINER=TRUE

if [ "$FAISS_ENABLE_GPU" = "ON" ]; then
    if [ "$CONTAINER_GPU_ACCESS" = "ON" ]; then
        export CIBW_TEST_COMMAND="${CIBW_TEST_COMMAND} && ${GPU_TEST_CMD}"
        export CIBW_CONTAINER_ENGINE="docker; create_args: --gpus all"
        export -n CIBW_BEFORE_TEST_LINUX
    fi
    export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=ON"
    export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel} --exclude libcublas.so.12 --exclude libcublasLt.so.12 --exclude libcudart.so.12"
else
    export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=OFF"
    export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel}"
fi

python3 -m cibuildwheel --output-dir wheelhouse --platform linux

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 23, 2024

@Di-Is CUDA backward compatibility is complicated, and the PyPI release should not expect any external dependency other than a few linked to CPython binary. https://github.com/pypa/manylinux

You can build a source package for your environment, but that wheel will not be compatible with other environments.

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 23, 2024

@Di-Is
Copy link

Di-Is commented Apr 23, 2024

CUDA backward compatibility is complicated,

I believe that installing the appropriate Nvidia drivers is not a matter of package management but rather a part of system setup, and the responsibility for execution lies with the user.
(This is also true for other package managers, e.g., Conda.)
Fortunately, installing the latest drivers will work with any version of CUDA and the binaries linked to it.

the PyPI release should not expect any external dependency other than a few linked to CPython binary.

It is correct that wheel files should be self-contained.
However, regarding this matter, it has been discussed in an auditwheel issue #368, and a feature to relax the restrictions has been merged into auditwheel.

@Di-Is
Copy link

Di-Is commented Apr 23, 2024

You can build a source package for your environment, but that wheel will not be compatible with other environments.

If the following conditions are met, Faiss installed from the created wheel should work properly.

  1. Run Faiss in an environment with an Nvidia Driver installed that is compatible with the CUDA being used.
  2. Do not load multiple versions of CUDA shared libraries in a single process (to avoid troublesome issues like symbol conflicts).

1.As mentioned earlier, it is the user's responsibility.
2.The system/package configuration should be reviewed, I believe.

@kyamagu
Copy link
Owner Author

kyamagu commented Apr 24, 2024

@Di-Is

However, regarding this matter, it has been discussed in an auditwheel issue pypa/auditwheel#368 (comment), and a feature to relax the restrictions has been merged into auditwheel.

This is not a matter of auditwheel but more fundamental issues in Python dependency management. In the current PyPI policy, managing GPU dependency is hard unless there is a standardized toolchain to build and test wheels for combinations of compiler / CUDA / driver / CPU arch / OS / Python versions, and recently, the compatibility with other packages like PyTorch. At least the current PyPI distribution is not designed well for different CUDA runtimes. If we ignore that and ship wheels for a very specific runtime configuration, we end up seeing a flood of error reports both here and in the upstream, which is obviously not a good thing. Conda is different from PyPI in that conda does manage runtime environments (e.g., CUDA).

My current approach is to at least leave the source distribution that works with any custom environment. Right now, I can't spend time on the GPU binary distribution, but you can try designing a build and test matrix to resolve the issues in the above configurations.

@CandiedCode
Copy link

@theLastOfCats You can manually download packages from the release page.

Hi @kyamagu,

Will all releases until pypi is resolved, have wheel packages available for download? I see currently only 1.7.3 has this. This is missing in 1.7.4 and 1.8.0 releases.

Thanks!

@kyamagu
Copy link
Owner Author

kyamagu commented Jul 5, 2024

@CandiedCode Currently, there is no plan to support GPU binary wheels. You can build the source package on your environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants